Sure. Who’d that be and how can I discuss the issue with him?
I see you are already talking to @DaveSch on the bug tracker.
Would you be willing to share your database schema, if it’s significantly different from the one created by the SQLite export? I’ve used the latter and created some views to hide the complexity of the joins, but am always interested in alternatives.
It’s very different! The SQLite export schema is highly normalized which makes it necessary to work with many joins to achieve anything. From an IT theoretical perspective, this is a very good thing but it does not make live easy for a developer. My PostgreSQL schema mimicks the objects in the XML file and the different views in Gramps itself (e g the person view, the event view and so on) which makes live a lot easier, in particular if views are added to the tables.
In which form would you like to get the schema? As SQL code or just as printout/screenshot?
SQL could would be great, thanks. I’ll message you with my email address.
Oh, interesting discussion going on here that I missed so far.
@UlrichDemlehner, @Nick-Hall, @StoltHD - a lot of the poor scaling behaviour that shows up when switching to Postgres from SQLite has nothing to do with the way the Gramps database is structured or whether we use Blobs or JSON. It is simply due to the fact that there is lots of code in Gramps where there is a loop over individual database calls to every single object in the Gramps database, which can lead to several thousand individual SELECT
statements, such that even a very small network lag (since Postgres uses a server/client architecture while SQLite is just a file) can lead to a delay of minutes.
See for example my comment here: Should the Plugin Registration fail gracefully for missing Prerequisites? - #14 by DavidMStraub
… as well as this Gramps Web API issue, which was making filtering/sorting on Postgres unbearably slow:
@UlrichDemlehner was the problem with the “can’t decode start byte” actually solved?
That’s exactly what I keep saying (and we both discussed that a few months ago already in the German Gramps forum). The existing frontend code has been written to match the database backend technology available with BSDDB (e g with BLOBs) some 15 or so years ago. Maybe it was not very efficient even at that time and with BSDDB since it didn’t use any indices in BSDDB (at least that is what I’ve been told). But that’s history and not my point. My point is that the technology of database backends has tremendously improved if you compare BSDDB with PostgreSQL or SQLite. So if the Gramps team wants to leverage those improvements, it must adjust the frontend code better to available database backends. This will very probably mean that the code should avoid storing the BLOBs in the database since this doesn’t leverage anything. And if your code analysis shows that there are many loops over many calls for objects stored in the database (those object probably being serialized Python objects in BLOBs), an obvious step forward would be to figure out how to avoid those loops in the frontend code for “pennies and small change” and instead let the database backend do what database backends have been invented for.
I haven’t seen any performance differences between PostgreSQL and SQLite which probably means that latency (on localhost?) is an issue at all.
Yes, see above Error using PostgreSQL as database backend with Gramps v5.2.2 - #13 by ennoborg. The problem was the SharedPostgreSQL option.
I can’t say anything about indices in Gramps’ BSDDB tables, because I haven’t seen those in years, but in SQLite, we have these:
And here you can see that apart from the indexes on gramps_id, for all tables, we also have some for person and place names, source titles, citation page fields, and media descriptors. And that means that some queries can be fast, including finding persons with similar names, but once you need their vitals to avoid false duplicates, you’re stuck.
The consequence is, that in reality, finding duplicate persons is at least a hundred times slower than in programs like PAF or RootsMagic, which rely on full relational databases stored in a single file.
There are ways around this, like creating dictionaries in memory, but those can only work if they fit, and they also need extra maintenance in code.
Our frontend code was written before we even had a database backend. The fundamental part of Gramps is its object library and the database acts as an object store. BSDDB fulfilled this function well for many years.
More to the point is that the code should use indexes. This has been discussed at depth in the past. In May 2016 I wrote a prototype to explore using the indexes of SQL backends in particular.
Having said that, you are also convincing be that using JSON instead of BLOBs for storing our objects is a good idea.
That’s my point. If you want to benefit from the capabilities of the database backend, you have to write your frontend in a way that actively uses those capabilities. It doesn’t help very much, if you speak perfect English and the person on the phone you’re talking to speaks only German.
At the end of the day, this will probably mean that Gramps needs quite a fundamental re-writing of the code. It was my hope that the Gramps Web project will do this since they have to re-write a lot of things anyway …
Yes and there are many opportunities for optimization. Another example is my patch of PrivateProxyDb
that used the private
SQL column and sped up some operations by orders of magnitude, see Speed up iter handles in private proxy db by DavidMStraub · Pull Request #418 · gramps-project/gramps-web-api · GitHub. My implementation back then was not architecturally acceptable so wasn’t merged, but an implementation along the lines of what Nick suggested in this PR shouldn’t be difficult. (In Web API, the problem was avoided in the meantime by using a custom subclass, see Speed up iter handles in private proxy db by DavidMStraub · Pull Request #418 · gramps-project/gramps-web-api · GitHub).
SharedPostgreSQL works perfectly fine if set up correctly (FYI Grampshub runs on SharedPostgreSQL).
I think the real issue you encountered @UlrichDemlehner is that there is a bug in psycopg2
which doesn’t show the correct error message in case something about the credentials (username, password, database name, port) is wrong.
When we talk about not using blobs, we of course mean to utilize the features that a relational database back-end actually provides, one of those this will of course be to write or optimize the front-end code to actually use those features…
One of the most obvious reasons for this is precisely to avoid unnecessary responses with thousands of lines of data when only a few lines or objects are all that is needed.
I completely agree that it is a waste that large amounts of data are constantly being transferred between a relational database and Gramps as a front-end program when it is not necessary, and that this has nothing to do with the type of database being used (if we disregard pure flat-file systems).
Gramps was made based on different “backend technology”.
I used the MongoDB solution as long as the drivers that were in Mantis could be used and the problem was exactly the same there, but the difference was that I could actually go in and create my own queries in the database, even from other software, and those queries, even when I listed the entire dataset, took only a fraction of the time it took to get equivalent data to Gramps. And this was with a relatively small database, around 3000 people, about 12000 events, 80000 place names (I was testing out a pure Norwegian place name database for sharing, but it took too long to fetch data into Gramps so I chose not to share the XML file or the database at that time in the hope that this would change).
And now that BSDDB is no longer in use and the choice has become a relational database, shouldn’t some time be spent on making changes in Gramps that use the strengths of a relational database as a back-end?
It becomes a bit backwards if an import/export to and from XML takes less time than a refresh of the data in Place View or Event view, or Person View in Gramps?
Or it takes longer to generate a graphical family tree in Gramps, than it takes to export data to a gedcom and open it in, for example, yEd?
And it should strictly not be necessary to spend 2-3 days to create a report from Gramps, even if you have a few 100k objects in the database?
I’m just thinking that one of the strengths of both SQLite and PostgreSQL is to generate reports, or data for reports…
And just to be clear, I don’t have enough data and don’t create reports with extreme queries, so I have no problems with the speed as it is today, except for a couple of DOT reports that crash if they are defined with too large a page size in dpi. or did, I don’t know if it’s fixed.
And even though we don’t always write it in every single sentence, we KNOW that there is a lot of work to make changes or to get new features to work."
Edit: Sorry for any grammatically errors, my neck and head are not at its best today.
I fully understand that and therefore I’m not discussing the history. I try to figure out how the future could look like and therefore, I’m pointing out issues that appear important to address. But of course I do not know the source code of Gramps and I do not know anything about coding with Python, so I simply might be pointing out the wrong issues.
No. The code should talk with the database backend in a way that allows the backend to use the indexes it has. If the code talks with the backend in terms of BLOBs (give me the BLOB, store that BLOB), the backend will do that and end of the whole story. As Enno pointed out, the frontend should instead ask for persons A, B, C with all their relevant properties (birth date and place, alternative names and so on) because that will allow the backend to leverage the indexes.
I built a really simple frontend application that communicates with the PostgreSQL backend in exactly that way and that frontend has response times (i e time to query plus time to format and display the query result) significantly below a second. At the end of the day, a table with 100k rows is small change for a modern PostgreSQL database.
I think we are talking about he same thing. The database query should use the appropriate indexes. As Enno pointed out, we already create indexes - we just don’t use them. My prototype back in 2016 got its increased performance by using indexes whilst still storing objects in BLOBs.
I am in favour of storing our objects as JSON rather than BLOBs. It would avoid the need to duplicate data in the index columns and it would also make the data accessible without the python pickle library.
Our fundamental object model is not relational and I wouldn’t want to change that. However, we can still utilise the strengths of different backends. For example we could introduce a new method to query data, or perhaps return a view consisting of multiple objects. We should also bear in mind that some people may want to use backends other than SQLite and PostgreSQL. Databases such as ArangoDB have been suggested.
I am in favour of storing our objects as JSON rather than BLOBs
I think the question to answer here is: does the database backend support indexing of JSON? If the answer is no (or perhaps “only limited”) then I cannot see a performance improvement.
Yes. SQLite supports the indexing of JSON objects.
I think the question to answer here is: does the database backend support indexing of JSON? If the answer is no (or perhaps “only limited”) then I cannot see a performance improvement.
Both sqlite and postgresql supports indexing of JSON stored in tables, they also support queries against views for json object as far as I could find…
Multi-model databases like ArngoDB or OrientDB also support views and indexes, but it would be more optimal to store JSON objects directly, but they can be queried with a SQL-like query syntax…
Yes. SQLite supports the indexing of JSON objects.
What does that mean? Does it speed up searches on indiviidual elements inside the JSON object? Or should a query still include a fully specified object to work?
And more important from here, is the advantage big enough to compensate for a delay in the migration to a fully relational model?