Help with regex to create Gramps case-sensitive filter

GRAMPS: AIO64-5.1.4-1 on W10pro. I need a bit of help taming the Python implementation of regex within Gramps.

I am trying to set up a filter to find (in notes) particular data in title case, i.e. lower case with an initial capital. For certain instances, I want to reformat these to all upper case. I do not want to do it globally, but at the moment I cannot get it to work for any single string. I want to set up a custom filter into which I enter my target string, and can then work though the results to tidy up the data so there is consistent case-formatting.

It seems that the routines called by Gramps parse all text on a case-insensitive basis, so that if I search for e.g. “\bSmith”, with “use regular expressions” checked, I still get results containing all the instances of “SMITH”, which are already the vast majority of instances within my data, and also why I want a filter to find the very much smaller number of title-case instances so they can be fixed.

Even if I construct a regex using \u0000 to define each of the characters I am looking for, in its appropriate case, Gramps still returns all instances on a case-insensitive basis.

Is there some way of forcing a Gramps regex to be case-sensitive?

1 Like

It is perfectly understandable that some Gramps filtering (e.g. on names of people) needs to be case insensitive.

In the case of names of people, there are preferences to display names in a variety of formats – where the available name formats not only consist of various name elements presented in a user-editable sequence, but where also the elements can be set to appear in uppercase, or [according to the Display name editor] otherwise appears literally as entered into the database. So it is very likely that many databases will contain name data in a variety of input case formats.

So the name display options enable consistent output of names, whatever the case formatting of the name elements at the time of entry of data for different people. Which for that purpose is a good thing.

But it is not at all clear why there would be no option for case sensitive filtering within the strings of descriptive fields (such as an event description), or of the text in notes.

If nothing else, Gramps needs to provide a practical mechanism for a user to format the names that appear within event descriptions and notes consistent with whatever name format they have selected for display purposes.

I do not know my way around Python code, and I certainly have not worked out which bits of code do the regex filtering (from the sidebar view) for the Events or Notes views. But I do notice that some .py files within my Gramps installation, apparently linked to filtering, include keywords such as “case insensitive” or “ignorecase”, and they contain a number of variables such as case_sensitive which can (presumably) be set true or false (e.g. in C:\Program Files\GrampsAIO64-5.1.4\gramps\gen\lib\baseobj.py ).

It is not obvious to me that Gramps was deliberately engineered to be completely case-insensitive for all filtering. It occurs to me there might be a bug, but it is perhaps more likely that inadvertent choices made for the majority of filtering instances has had the effect of disabling case sensitivity elsewhere (pehaps everywhere)?

Unless someone who is more adept at undertanding Python code than I am (I am not setting the bar very high!) can verify there is a bug somewhere in the Gramps code relating to forcing all filtering to case-insensitivity – in which case I will file a bug report – I propose to file an enhancement request to enable case-sensitivity for at least the description field associated with events, and for notes. Any other comments?

There is another thread started in February that is trying to explore which dialect of RegEx is supported by Gramps:

And this forum thread was mentioned there. But maybe you can enlist a developer’s help to try using the 3rd Party regex library instead of Python’s native re library?

It would need some performance testing. And verification that it actually expands the case-sensitivity controls for Unicode pattern matching.

The SuperTool add-on might also be an alternative for a deeper level of control than the re library for Python offered.

Thanks Brian, I have already looked at the pypi regex package, which does looks promising. I have commenced working out how I might test it. But I need to proceed with great caution.

I only noticed your earlier post about which flavour of regex Gramps was currently using after I had posted my original request. As an aside, the Discourse suggestion mechanism about existing “similar” posts is not at all impressive – despite the explicit keyword regex in both our posts, yours was not offered as being similar, but a list of completely and inexplicably irrelevant ones were pushed at me nevertheless, which at the time was a major distraction.

It seems that despite advice to the contrary, the Gramps regex implementation is different from most recent Python ones, in that Gramps appears to default to case insensitivity, which as far as I can tell is the reverse of the usual default.

Knowing about the perhaps idiosyncratic Gramps behaviour does help, but unfortunately invoking case sensitivity in Gramps currently seems to be completely inaccessible.

I will also undertake some more testing of the Supertool addon, but I still have a big learning curve ahead of me on that front. I doubt that it will help much, since as far as I can tell it does not assemble a list from which the Gramps event or note editor can be invoked iteratively. The nature of the changes I need to make to the case-formatting of various parts of the text data in descriptions or notes is such that it is not likely the process could be automated (even if I had the regex experience to code for replacement and then to commit changes). Rather, I need a mechanism to locate objects in which at least one (and often many more than one) instance of wrongly-formatted names exist. Locating the object is the first step, then all the many (likely different) names in that text object can be edited in a single though inevitably manual pass. A typical example is a transcript of a long and potentially complicated newspaper article which was entered into Gramps at a time when I was using a different Gramps name display format, and with data entry formatting matching the name display format of name objects at that time. But I have subsequently changed my preferred name display format. Now that I have settled on a preferred name format, I need to update the formatting of text within the descriptions and the notes so they are consistent with the display formatting of names of people in the db.

At the very least it would be helpful if the Gramps documentation stated that some “standard” Python regex behaviour such as switching case sensitivity on or off is NOT accessible in Gramps! I have wasted a huge amount of time fruitlessly trying to get it to work – I just hope others don’t have to go through all that again. When I have a bit more clarity about the current limits, I will have a go at updating that part of the wiki to at least give people a warning.

I suspect an enhancement request might be needed to flag this, as it doesn’t look like case-sensitivity will be enabled very quickly.

1 Like

Improving the RegEx documentation is why the thread was started. I lack the interest to explore that feature in Gramps. (There a MANY other features that beckon more stridently.) But wanted the Wiki to provide better leads on its use. So questions were asked instead of exploring aimlessly.

@SNoiraud pointed out that enabling the Regular Expression option converts the Filter Gramplet’s Name search into looking for a ‘phrase’ instead of the much slower ‘all words’. This is not something the typical user would suspect.

This was related to PR In personsidebarfilter, search on each part of name by rtclay ¡ Pull Request #674 ¡ gramps-project/gramps ¡ GitHub followed by PR Using regex in the sidebar gives different result by SNoiraud ¡ Pull Request #900 ¡ gramps-project/gramps ¡ GitHub
The first one was related to the bug 0007950: Name filter should work even when omitting middle names - Gramps - Bugtracker – Free Genealogy Software

1 Like

When we use regexp, if you use for example (a|b)
a means only a and not A.
If you want to use a or A, you must use: ([aA]|(bB])
This is how lexeme research works.
If you are looking for Axel: axel and AXEL do not match and this is normal.

If you really want this, the only solution would be to use re.IGNORECASE but this must be an additional option because not everyone wants to use it.

But this is not what Gramps implements, this is the opposite

I look for the final lowercase m alone and yet it returns both.

Same thing in notes:

This is because we use the IGNORECASE in all requests.
See requestprepare in _rules.py
we have re.I:

        try:
            self.regex[index] = re.compile(self.list[index], re.I)
        except re.error:

So the only solution for this problem is to add an option for “use rexexp” which could be “Respect the case”

1 Like

But Serge, at least on my install (AIO64-5.1.4-1 on W10pro), that is NOT how Gramps actually behaves!

Filtering for Axel or AXEL or axel all produce identical results!

If (in the People view sidebar) I enter “axel” or “AXEL”, with “use regular expressions” enabled, all the result instances I get are as “Axel” (and “Axelina”), and in my data, I do not find a single instance of “axel” or “AXEL” (but all filters using different case patterns give me the same “Axel” & “Axelina” results).

In the people view, this is potentially complicated by the existence of name display formatting which might override the input case of the entries.

In the description field of events, or in the text of notes, exactly the same behaviour is evident, so for the same filter as for the people view, and again with “use regular expressions” enabled, all result instances are as “Axel” (and in my particular Event.Description data, also of “WAXELL”), but again none of my results is as “axel” or “AXEL”, regardless of the case of the filter pattern.

So the evidence appears to be that Gramps does something like an upper() on both the pattern and the target before it conducts a regex, or else the regex is always case-insensitive.

I would love to know how a filter in either the Events Description field, or the Notes text, can be made actually selective as to case(s) of the target string.

Ignorecase is what I suspected, but where is this documented?

Why is it apparently global within Gramps when the Python default is the opposite?

The version I see in my install is –

try:
self.regex[i] = re.compile(self.list[i], re.I)
except re.error:

Is that functionally the same, or does it have a different meaning [vis-a-vis the “index” version in your snippet below]?

Exactly what edit do we need to enable case sensitivity? Is _rules.py the only file that needs to be altered?

I can trial this before filing an enhancement request … I am happy to force it to case sensitivity for ALL filtering while I trial it.

Do you mean that if you select Axel, you only want Axel, axel, AXEL but not WAXELL and AXELLINA?
You only want Axel?

One of the things mentioned in the 3rd-party regex library documentation is the improvement in CASE handling for Unicode.

Is handling of the extended set of Unicode glyph cases related to why Gramps uses the IGNORECASE switch be default?

Maybe switching libraries would create another option?

Note: Found an interesting Python bug report about “It is undocumented that re.UNICODE and re.LOCALE affect re.IGNORECASE”

Also interesting from StackOverflow: How do I match all unicode lowercase characters in Python with a regular expression?

No – if my pattern is Axel, I want only “Axel” on a case-sensitive basis – not AXEL, not axel, not aXEL, not aXel etc

I can exclude e.g. WAxell or Axellina – which I also do not want – if I use a word boundary “\b” before & after my pattern string – as in “\bAxel\b”.

But my main problem is that I cannot currently filter by on a case-sensitive basis to distinguish between “Axel” and “AXEL”.

I think this is a feature request. As I said previously we need a new option to select a “strict search”.

1 Like

Gentlemen;
Gramps is using regex, and as Serge points out is using the case insensitive option “re.I”. Deleting that bit in the requestprepare method of _rule.py would make all the filters that use the rule.requestprepare without overriding them to become sensitive to case. I for one would not want that as for most uses the searches should return results regardless of case. And I think we already have too many options…

If you want to create a specialized filter that is case insensitive, you could create an addon filter that overrides the requestprepare method, leaving out the “re.I” option. Maybe something like CaseSensRegExpName.

As to why we do it this way, you would have to ask the authors from 11 years ago…

2 Likes

Maybe a separate case-sensitive RegEx Filter tool that (quickly) Tags? Maybe based on the Addon:AddRemoveTagTool?

(If the option was a Gramplet, like a modified RegEx version of the Filter Gramplet, it could filter the View results. But its extra functionality could not be extended to filters in the export/reports/etc. unless the view results are Tagged. But tagging an extended selection in the results of a view has excess refreshes, making the tagging process incredibly slow. The Addon:AddRemoveTagTool bypasses the extraneous refreshes.)

Paul, I agree that the default insensitive behaviour should not be removed.

You overlook how you & Serge have already contributed to better documentation of Gramps, because your expertise was needed to confirm that indeed the current Gramps behaviour is always case insensitive. I will update the wiki so at least other people are made aware of that, and don’t waste time attempting case-sensitive filtering!

I will also file a feature request.

1 Like

Thank you! This was a wiki section that made me uncomfortable to work on it.

I have a large database, and sometimes I understand that might be helpful.
I wrote a patch for that on Master. I’m gonna do a PR for that.

3 Likes