Error using PostgreSQL as database backend with Gramps v5.2.2

UlrichDemlehner · April 26, 2024, 5:36pm

Sure. Who’d that be and how can I discuss the issue with him?

Nick-Hall · April 26, 2024, 5:39pm

I see you are already talking to @DaveSch on the bug tracker.

GeorgeWilmes · April 26, 2024, 8:33pm

Would you be willing to share your database schema, if it’s significantly different from the one created by the SQLite export? I’ve used the latter and created some views to hide the complexity of the joins, but am always interested in alternatives.

UlrichDemlehner · April 26, 2024, 9:30pm

It’s very different! The SQLite export schema is highly normalized which makes it necessary to work with many joins to achieve anything. From an IT theoretical perspective, this is a very good thing but it does not make live easy for a developer. My PostgreSQL schema mimicks the objects in the XML file and the different views in Gramps itself (e g the person view, the event view and so on) which makes live a lot easier, in particular if views are added to the tables.

In which form would you like to get the schema? As SQL code or just as printout/screenshot?

GeorgeWilmes · April 27, 2024, 12:45am

SQL could would be great, thanks. I’ll message you with my email address.

DavidMStraub · April 27, 2024, 11:09am

Oh, interesting discussion going on here that I missed so far.

@UlrichDemlehner, @Nick-Hall, @StoltHD - a lot of the poor scaling behaviour that shows up when switching to Postgres from SQLite has nothing to do with the way the Gramps database is structured or whether we use Blobs or JSON. It is simply due to the fact that there is lots of code in Gramps where there is a loop over individual database calls to every single object in the Gramps database, which can lead to several thousand individual SELECT statements, such that even a very small network lag (since Postgres uses a server/client architecture while SQLite is just a file) can lead to a delay of minutes.

See for example my comment here: Should the Plugin Registration fail gracefully for missing Prerequisites? - #14 by DavidMStraub

… as well as this Gramps Web API issue, which was making filtering/sorting on Postgres unbearably slow:

github.com/gramps-project/gramps-web-api

Sorting and date filtering is too slow

opened 04:22PM - 06 Oct 23 UTC

closed 06:45PM - 10 Oct 23 UTC

DavidMStraub

Investigating performance bottlenecks, I realized queries that involve a sort or… a date filter can be extremely slow. This applies for instance for the query used for the anniversaries widget or the selection of the most recent blog post (which requires sorting all sources by date first), but also for all object list views. The reason for the slowness is that we are sorting a list of handles by a function acting on the object, and fetching the object for the handle requires one round-trip to the database, so for thousands of objects these are thousands of database queries. This is not an issue for a SQLite on a local machine, but when using a database accessed via the network (as will be the case for any medium sized deployment), even a latency of a couple of milliseconds per query can lead to many seconds of waiting time for a simple sort. The proper solution for implementing performant sort would be to use the auxiliary database columns, but this is not implemented in Gramps, and I don't think we should start reimplenting that logic outside of Gramps. So for the moment, I think the best solution would be to simply sort objects in memory. This is what Gramps does anyway for the Gramps internal filters. It will lead to higher memory usage, but Gramps objects are typically around 100 bytes in size, so for 10,000 objects this would be only 1 MB. So I think it's a clear tradeoff if it allows speeding up some trivial queries from half a minute to sub-second.

@UlrichDemlehner was the problem with the “can’t decode start byte” actually solved?

UlrichDemlehner · April 27, 2024, 11:43am

That’s exactly what I keep saying (and we both discussed that a few months ago already in the German Gramps forum). The existing frontend code has been written to match the database backend technology available with BSDDB (e g with BLOBs) some 15 or so years ago. Maybe it was not very efficient even at that time and with BSDDB since it didn’t use any indices in BSDDB (at least that is what I’ve been told). But that’s history and not my point. My point is that the technology of database backends has tremendously improved if you compare BSDDB with PostgreSQL or SQLite. So if the Gramps team wants to leverage those improvements, it must adjust the frontend code better to available database backends. This will very probably mean that the code should avoid storing the BLOBs in the database since this doesn’t leverage anything. And if your code analysis shows that there are many loops over many calls for objects stored in the database (those object probably being serialized Python objects in BLOBs), an obvious step forward would be to figure out how to avoid those loops in the frontend code for “pennies and small change” and instead let the database backend do what database backends have been invented for.

I haven’t seen any performance differences between PostgreSQL and SQLite which probably means that latency (on localhost?) is an issue at all.

UlrichDemlehner · April 27, 2024, 11:48am

Yes, see above Error using PostgreSQL as database backend with Gramps v5.2.2 - #13 by ennoborg. The problem was the SharedPostgreSQL option.

ennoborg · April 27, 2024, 1:45pm

I can’t say anything about indices in Gramps’ BSDDB tables, because I haven’t seen those in years, but in SQLite, we have these:

And here you can see that apart from the indexes on gramps_id, for all tables, we also have some for person and place names, source titles, citation page fields, and media descriptors. And that means that some queries can be fast, including finding persons with similar names, but once you need their vitals to avoid false duplicates, you’re stuck.

The consequence is, that in reality, finding duplicate persons is at least a hundred times slower than in programs like PAF or RootsMagic, which rely on full relational databases stored in a single file.

There are ways around this, like creating dictionaries in memory, but those can only work if they fit, and they also need extra maintenance in code.

Nick-Hall · April 27, 2024, 2:28pm

Our frontend code was written before we even had a database backend. The fundamental part of Gramps is its object library and the database acts as an object store. BSDDB fulfilled this function well for many years.

More to the point is that the code should use indexes. This has been discussed at depth in the past. In May 2016 I wrote a prototype to explore using the indexes of SQL backends in particular.

Having said that, you are also convincing be that using JSON instead of BLOBs for storing our objects is a good idea.

UlrichDemlehner · April 27, 2024, 2:28pm

That’s my point. If you want to benefit from the capabilities of the database backend, you have to write your frontend in a way that actively uses those capabilities. It doesn’t help very much, if you speak perfect English and the person on the phone you’re talking to speaks only German.

At the end of the day, this will probably mean that Gramps needs quite a fundamental re-writing of the code. It was my hope that the Gramps Web project will do this since they have to re-write a lot of things anyway …

DavidMStraub · April 27, 2024, 2:30pm

Yes and there are many opportunities for optimization. Another example is my patch of PrivateProxyDb that used the private SQL column and sped up some operations by orders of magnitude, see Speed up iter handles in private proxy db by DavidMStraub · Pull Request #418 · gramps-project/gramps-web-api · GitHub. My implementation back then was not architecturally acceptable so wasn’t merged, but an implementation along the lines of what Nick suggested in this PR shouldn’t be difficult. (In Web API, the problem was avoided in the meantime by using a custom subclass, see Speed up iter handles in private proxy db by DavidMStraub · Pull Request #418 · gramps-project/gramps-web-api · GitHub).

DavidMStraub · April 27, 2024, 2:34pm

SharedPostgreSQL works perfectly fine if set up correctly (FYI Grampshub runs on SharedPostgreSQL).

I think the real issue you encountered @UlrichDemlehner is that there is a bug in psycopg2 which doesn’t show the correct error message in case something about the credentials (username, password, database name, port) is wrong.

github.com/psycopg/psycopg2

psycopg2 The error message does not match the actual situation

opened 03:01AM - 13 Oct 23 UTC

a984120978

**This is a bug tracker** If you have a question, such has "how do you do X wit…h Python/PostgreSQL/psycopg2" please [write to the mailing list](https://lists.postgresql.org/manage/) or [open a question](https://github.com/psycopg/psycopg2/discussions) instead. **Please complete the following information:** - OS:win11 - Psycopg version:2.9.9 - Python version:3.11 - PostgreSQL version:15.9 - pip version 22.1 - PC NAME: pc20070343 **Describe the bug** Please let us know: Another computer is connected normally I cannot connect to myself using PC20070343 conn = psycopg2.connect( host="pc20070343", port="5432") Traceback (most recent call last): File "C:\Users\Ot903056\PycharmProjects\PublicServer\test.py", line 4, in <module> conn = psycopg2.connect( ^^^^^^^^^^^^^^^^^ File "C:\Users\Ot903056\PycharmProjects\PublicServer\venv\Lib\site-packages\psycopg2\__init__.py", line 122, in connect conn = _connect(dsn, connection_factory=connection_factory, **kwasync) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ UnicodeDecodeError: 'utf-8' codec can't decode byte 0xd6 in position 84: invalid continuation byte BUT! conn = psycopg2.connect( host="127.0.0.1", port="5432") Successfully connected! conn = psycopg2.connect( host="locationhost", port="5432") # not input database Traceback (most recent call last): File "C:\Users\Ot903056\PycharmProjects\PublicServer\test.py", line 4, in <module> conn = psycopg2.connect( ^^^^^^^^^^^^^^^^^ File "C:\Users\Ot903056\PycharmProjects\PublicServer\venv\Lib\site-packages\psycopg2\__init__.py", line 122, in connect conn = _connect(dsn, connection_factory=connection_factory, **kwasync) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ UnicodeDecodeError: 'utf-8' codec can't decode byte 0xd6 in position 61: invalid continuation byte The error message does not match the actual situation

StoltHD · April 27, 2024, 2:39pm

When we talk about not using blobs, we of course mean to utilize the features that a relational database back-end actually provides, one of those this will of course be to write or optimize the front-end code to actually use those features…

One of the most obvious reasons for this is precisely to avoid unnecessary responses with thousands of lines of data when only a few lines or objects are all that is needed.

I completely agree that it is a waste that large amounts of data are constantly being transferred between a relational database and Gramps as a front-end program when it is not necessary, and that this has nothing to do with the type of database being used (if we disregard pure flat-file systems).
Gramps was made based on different “backend technology”.

I used the MongoDB solution as long as the drivers that were in Mantis could be used and the problem was exactly the same there, but the difference was that I could actually go in and create my own queries in the database, even from other software, and those queries, even when I listed the entire dataset, took only a fraction of the time it took to get equivalent data to Gramps. And this was with a relatively small database, around 3000 people, about 12000 events, 80000 place names (I was testing out a pure Norwegian place name database for sharing, but it took too long to fetch data into Gramps so I chose not to share the XML file or the database at that time in the hope that this would change).

And now that BSDDB is no longer in use and the choice has become a relational database, shouldn’t some time be spent on making changes in Gramps that use the strengths of a relational database as a back-end?

It becomes a bit backwards if an import/export to and from XML takes less time than a refresh of the data in Place View or Event view, or Person View in Gramps?
Or it takes longer to generate a graphical family tree in Gramps, than it takes to export data to a gedcom and open it in, for example, yEd?

And it should strictly not be necessary to spend 2-3 days to create a report from Gramps, even if you have a few 100k objects in the database?
I’m just thinking that one of the strengths of both SQLite and PostgreSQL is to generate reports, or data for reports…

And just to be clear, I don’t have enough data and don’t create reports with extreme queries, so I have no problems with the speed as it is today, except for a couple of DOT reports that crash if they are defined with too large a page size in dpi. or did, I don’t know if it’s fixed.

And even though we don’t always write it in every single sentence, we KNOW that there is a lot of work to make changes or to get new features to work."

Edit: Sorry for any grammatically errors, my neck and head are not at its best today.

UlrichDemlehner · April 27, 2024, 2:45pm

I fully understand that and therefore I’m not discussing the history. I try to figure out how the future could look like and therefore, I’m pointing out issues that appear important to address. But of course I do not know the source code of Gramps and I do not know anything about coding with Python, so I simply might be pointing out the wrong issues.

No. The code should talk with the database backend in a way that allows the backend to use the indexes it has. If the code talks with the backend in terms of BLOBs (give me the BLOB, store that BLOB), the backend will do that and end of the whole story. As Enno pointed out, the frontend should instead ask for persons A, B, C with all their relevant properties (birth date and place, alternative names and so on) because that will allow the backend to leverage the indexes.

I built a really simple frontend application that communicates with the PostgreSQL backend in exactly that way and that frontend has response times (i e time to query plus time to format and display the query result) significantly below a second. At the end of the day, a table with 100k rows is small change for a modern PostgreSQL database.

Nick-Hall · April 27, 2024, 3:35pm

I think we are talking about he same thing. The database query should use the appropriate indexes. As Enno pointed out, we already create indexes - we just don’t use them. My prototype back in 2016 got its increased performance by using indexes whilst still storing objects in BLOBs.

I am in favour of storing our objects as JSON rather than BLOBs. It would avoid the need to duplicate data in the index columns and it would also make the data accessible without the python pickle library.

Our fundamental object model is not relational and I wouldn’t want to change that. However, we can still utilise the strengths of different backends. For example we could introduce a new method to query data, or perhaps return a view consisting of multiple objects. We should also bear in mind that some people may want to use backends other than SQLite and PostgreSQL. Databases such as ArangoDB have been suggested.

UlrichDemlehner · April 27, 2024, 3:45pm

I think the question to answer here is: does the database backend support indexing of JSON? If the answer is no (or perhaps “only limited”) then I cannot see a performance improvement.

Nick-Hall · April 27, 2024, 3:53pm

Yes. SQLite supports the indexing of JSON objects.

StoltHD · April 27, 2024, 3:57pm

Both sqlite and postgresql supports indexing of JSON stored in tables, they also support queries against views for json object as far as I could find…

Multi-model databases like ArngoDB or OrientDB also support views and indexes, but it would be more optimal to store JSON objects directly, but they can be queried with a SQL-like query syntax…

ennoborg · April 27, 2024, 5:36pm

What does that mean? Does it speed up searches on indiviidual elements inside the JSON object? Or should a query still include a fully specified object to work?

And more important from here, is the advantage big enough to compensate for a delay in the migration to a fully relational model?

Topic		Replies	Views
SharedPostgreSQL addon installation error Help third-party-addon	18	972	September 4, 2022
Cannot open Database [from 2011 Gramps 3.3.1] Help data-recovery	8	502	October 31, 2020
Gramps Cannot Find Sqlite Help	5	68	December 11, 2024
Corrupted database? Gramps Web	13	87	August 30, 2024
Can not open GPKG Help	3	594	February 13, 2021

Error using PostgreSQL as database backend with Gramps v5.2.2

Related topics