I’m running Gramps v5.1.5 on Win10. My database has some 100,000 persons with some 250,000 events and so on. I export my Gramps data with the SQLite exporter into a SQLite database where I do a lot of SQL stuff to modify data and to feed them into some other downstream applications I’m using for e g presenting the data, calculating historical demography statistics and so on.
So it was obvious for me to try to re-import the data from the SQLite database back into Gramps. This works probably quite well for small databases but will slow down with my large database more and more until nothing appears to happen anymore. Consequently, I’m forced to stop the import after staring hours on the virtually frozen progress bar (or after going to sleep and checking the next day ). The task manager shows that the import is still running and consuming a lot of the machine ressources but the import is obviously lost in increasingly time consuming tasks and steps without coming to an end.
The work-around I’m currently using is “tweaking” the XML export (the .gramps file) and re-importing that modified XML file back into a new Gramps database. This process works quite well from a speed perspective (the .gramps import will run for a few minutes) but the real challenge is of course to “tweak” the XML file based on the SQLite data without breaking it.
Has anybody a solution for this problem? Thanks for your comments and ideas on this topic!
Maybe you could improve things by creating some indexes on the exported database and/or modifying the import queries? If nothing else, try using EXPLAIN QUERY PLAN to find out what’s going on.
In ExportSQL.py, only two indexes are created in the exported database, one for the surname table and one for the link table:
db.query("""CREATE INDEX idx_surname_handle ON
db.query("""CREATE INDEX idx_link_to ON
link(from_type, from_handle, to_type);""")
In ImportSQL.py, every SELECT statement has a WHERE clause which references each table’s “handle” column. Although the handle is defined as the primary key, it is not an integer primary key like rowid, and there are no indexes.
Using EXPLAIN QUERY PLAN can show you whether or not SQLITE will create an automatic index for the handle column (for example, the explain output might show “SEARCH TABLE person USING INDEX sqlite_autoindex_person_1 (handle=?)”), or whether it will do a table scan. Scanning the entire person table 100,000 times would probably be slow.
I have a small C# program called Gramps.Net that reads compressed Gramps XML into an XML Document, does some tweaks, and then write it back under a new name. It works quite fast on my tree. which is about 12,000 persons now, but I haven’t tested it on bigger ones.
My latest version runs on Linux, with dotnet, and Visual Studio Code, and it’s a command line program. It will probably also work with the Visual Studio Community Edition on Windows 10, and I can put it on GitHub if you want.
Interesting. I see both indexes in the SQLite database, but I also see a sqlite_autoindex_… index in each table. So I assumed (without thinking too much, I have to confess) that the tables in the SQLite database are indexed. Reading what you say about rowid and primary keys, I guess this assumption was quite unfounded and simply wrong.
I don’t understand how I would issue an EXPLAIN QUERY PLAN command when the re-import from SQLite into Gramps is running. Could you please add a few more details to your proposal?
thank you very much for your kind offer but I already have some similar programs that modify the XML file directly. You don’t happen to have a software that uses the SQLite database as input and builds a XML file (.gramps File) from the SQLite database? That XML file could then be imported into Gramps and voila, all my problems would disappear
Ok, I see your point. If I understand your approach correctly, I do not see how you could do “advanced SQL stuff” (e g a JOIN of tables) in the XML file. Is this understanding correct or did I miss something?
You would do this before running the import, within your SQLite environment, for example:
sqlite> explain query plan select * from person where handle = ‘x’;
You could also see if generating statistics via ANALYZE makes any difference.
And only if necessary, use CREATE INDEX commands.
Maybe the import queries are already performing as well as possible, and the slowness is somewhere else. I was just trying to think of ways that the SQL import would be different from the XML import and that seemed like a possibility.
Dotnet has a built in query language called LINQ, which works on collections, which allows joins too, and there may be similar functions for XML documents too, but most of my hacks don’t need such things, so I haven’t tested much in that area.
You can do a lot of things with XPath expressions, if you want.
I just ran a small test with Visual Studio running on Windows 10 inside a virtual machine, and it could load a test tree with more than 300,000 persons in a few seconds. And once the XML document is loaded, you can do all sorts of weird things in memory.
This is of course not what you’d normally do, but with the current amounts of RAM, it is too easy to ignore.
Ok, I see your point. And your thinking is very much appreciated . I have never thought of SQLite and the tables there being the problem and was always focused on Gramps and the tables there as the bottleneck, but your arguments make very much sense.
Most of the “tweaking” is “string stuff” as I call it, i e searching for strings or part of strings in the XML export, replacing with another string, and re-importing the modified XML file. In SQL lingo, this would be analogous to UPDATE commands in a SQL database.
Another “tweak” is the mass creation of objects. For some reason I need a note of a specific type attached to some 2300 places. So I wrote a SQL script for the SQLite export that identifies those 2k places in need of a note. Currently I’m developing a short R script that will use the data from the SQL script, read the .gramps XML export and create the notes in the XML file. In principle this is quite straight forward but debugging a XML file with more than 3 mio lines is really a pain in the ass.
Most of what I do is the same, like finding given names that end with ‘von’ or ‘von der’, and moving those parts to the surname. I can do that by writing a foreach statement that loops through the names of all person nodes in the XML tree, and writes the changes to those same nodes.
You could do the same for the places, but you’d need to do some reverse engineering to figure out how to create a new note XML object, with a unique ID and handle.
I did this in C#, because I had to work with that language professionally, and I didn’t like working with different languages at home and at work.
Do you have an idea about what makes the SQL import so slow? My instinct says that it’s the part that writes new objects to the database, but it may also be the random reads that are required to pick rows from the normalized database, before they are re-assembled into the Python objects that are then pickled (nice term) and written to Gramps’ own database.
Right, but as I said: this is quite straight forward. The structure of the XML file is much simpler than the schema of the SQLite database (even for somebody like me who has only a very vague understanding of XML files). Since R has some fantastic packages I’m using, my own work are basically 20 or 30 lines of code and that’s it. So I now have the 2k new notes linked to the places.
Not really. As I said my thoughts were similar to yours, i e that the Gramps part is somehow the bottleneck. But George is right to point out that the very first step is pulling the data from the SQLite database and that may be the problem if there are no indexes. I’ll play around a bit over the next few days.
No, sorry, I do not “speak” Python. As you said: I don’t like working with different languages, so I have yet to find a really convincing reason to switch to Python. I’m a hobby programmer, so my portfolio of Visual Basic, SQL (with the help of a graphical interface to develop code), and R must be enough
Right! It’s a good idea to stick with the tools that you know, and I just found that there are also toolkits to let R process XML.
Backend wise, my personal wish is that we move to a relational model that can interface with SQLite, or any other backend you like, so that Gramps stays as simple as possible. RootsMagic has such a relational model, implemented on SQLite, and it’s one of the fastest programs around, although GeneWeb is even faster. Geneanet runs on that.
I have to confess that I played around a bit with the R packages for XML but got nowhere. So I decided to stick with the string tools I know. Kind of a brute force approach, of course, and not very sophisticated, but a least this approach produces results.
Backend wise, I fully agree with you. Gramps desperately needs a good database backend for really large databases. I feel it’s debatable if SQLite could be such a backend. My preference would be PostgreSQL but I’m really no database specialist.
Technically speaking, I think that other backends can perform better, because they don’t store all data in a single big file, like SQLite does, but I like that single file approach, because users don’t need to install anything, and don’t need to configure things either.
And my experience says that SQLite can work pretty fast. I have a Charlemagne GEDCOM with more than 600,000 persons in it, which imports much faster into RootsMagic than it does in Gramps, and opens faster in that too. And an old program like PAF is also much faster with that, using a format that I don’t know much about, but which seems to be custom. And that’s also a single file. The fun part of that is that I can still open that database on a laptop with 64 MB (yes, MegaBytes) of RAM.
And this suggests that the whole backends discussion is nice for developers, but is probably quite meaningless for users, who may be more interested in speed.
The main problem that we have now is that the main database has pickled Python objects in it, which are quite hard to access with other languages.
Ok, the single file argument has something to it, but if you for example install PostgreSQL on Win10 as a fast LOCAL database, you won’t notice any difference. The configuration issues start only if you want to access the database over your local network or over the Internet, but let’s be fair, those issues are networking issues and not real database issues. So I’d clearly prefer a faster and more sophisticated database backend. But having said this, SQLite will probably be more than appropriate.
Again I agree. The pickles are a classical IT strategic dead end and the faster they can be dropped, the better for the whole community. I guess they made a lot of sense 10 years or so ago but today, I do not see much benefit using them. But ok, I’m not a Python programmer – may be the world looks different from their perspective.