Fixing Proxies properly

@Nick-Hall @SteveY @DavidMStraub @kulath : just letting you know that I am planning on fixing the root of our type issues (and fixing lots of issues along the way). Currently, we have an issue with proxies that depend on get_ITEM_from_handle() as a filter and possibly returning None. I’ll post a proposed fix here (below).

The plan (in order) is to:

  1. Fix proxies (no returning None allowed on getting objects); then:
  2. https://github.com/gramps-project/gramps/pull/1934 - Minimal code changes to support type hinting of db methods; then:
  3. https://github.com/gramps-project/gramps/pull/1919 - Add type hints to db classes

Gramps Proxy Refactor Plan

Problem

Gramps supports “proxy” databases that wrap a real database (or another proxy) and
filter out certain items. For example:

  • PrivateProxyDb hides objects marked private
  • LivingProxyDb hides living people
  • FilterProxyDb hides objects not matching a filter

The current implementation has a fundamental API flaw: when you request a filtered
object by handle, the proxy returns None. This means every caller must defensively
check:

person = db.get_person_from_handle(handle)
if person is not None:
    ...

Worse, a caller may obtain a handle from a cross-reference list (e.g., iterating
family.child_ref_list) and then look it up only to receive None, because the
reference list was never cleaned of filtered handles. This is the root cause: the
proxy filters at lookup time but not at reference-list return time, so filtered
handles leak into the caller’s hands.


Solution Overview

Two cleanly separated layers of filtering:

  1. include_*(handle) methods — binary gate: should this entire object be
    visible? Used to raise HandleNotFoundError on direct filtered lookups, and to
    strip filtered handles from cross-reference lists in returned objects.

  2. sanitize_*(data) methods — attribute-level cleanup on a DataDict: given a
    visible object’s raw data, strip sub-attributes that should not be shown (e.g.,
    private notes on a public person, private citations, private associations).

Together these guarantee: a caller who receives an object from the proxy will never
hold a handle that resolves to None or raises unexpectedly.
Every handle in every
reference list has already been validated as included.


Working Medium: DataDict instead of reconstructed objects

The key implementation insight is to work on DataDicts (returned by
get_raw_*_data()) rather than reconstructed full objects (like Person()).

The underlying real database’s get_raw_person_data(handle) returns a DataDict
a dict subclass that supports direct attribute access (.family_list,
.child_ref_list, .father_handle, etc.) and is much cheaper to create than a
full Person object.

Revised call chain

Instead of the current chain where get_raw_person_data calls
get_person_from_handle:

(current) get_raw_person_data → get_person_from_handle → object_to_data

The proxy reverses this: the core filtering and sanitization happen on the DataDict,
and get_person_from_handle is just a thin wrapper:

(new) get_raw_person_data(handle):
        check include_person(handle) → raise if filtered
        data = self.db.get_raw_person_data(handle)   # DataDict from inner db
        strip filtered cross-refs directly on data
        sanitize_person(data)                         # strip private sub-attrs
        return data

    get_person_from_handle(handle):
        return self.get_raw_person_data(handle)       # DataDict is the object

This means proxies override get_raw_*_data() as the primary method. No full object
reconstruction is ever needed. sanitize_* methods are simple list-filter operations
on the DataDict.

Example: get_raw_person_data in ProxyDbBase

def get_raw_person_data(self, handle):
    if not self.include_person(handle):
        raise HandleNotFoundError(handle)
    data = self.db.get_raw_person_data(handle)
    # Layer 1: strip cross-refs to filtered objects using include_* predicates
    data.family_list = [
        h for h in data.family_list if self.include_family(h)
    ]
    data.parent_family_list = [
        h for h in data.parent_family_list if self.include_family(h)
    ]
    data.event_ref_list = [
        ref for ref in data.event_ref_list if self.include_event(ref.ref)
    ]
    data.person_ref_list = [
        ref for ref in data.person_ref_list if self.include_person(ref.ref)
    ]
    data.note_list = [h for h in data.note_list if self.include_note(h)]
    data.citation_list = [h for h in data.citation_list if self.include_citation(h)]
    data.media_list = [
        ref for ref in data.media_list if self.include_media(ref.ref)
    ]
    # Layer 2: strip private sub-attributes (no-op in base class)
    return self.sanitize_person(data)

Example: get_raw_family_data in ProxyDbBase

def get_raw_family_data(self, handle):
    if not self.include_family(handle):
        raise HandleNotFoundError(handle)
    data = self.db.get_raw_family_data(handle)
    if not self.include_person(data.father_handle):
        data.father_handle = None
    if not self.include_person(data.mother_handle):
        data.mother_handle = None
    data.child_ref_list = [
        ref for ref in data.child_ref_list if self.include_person(ref.ref)
    ]
    data.event_ref_list = [
        ref for ref in data.event_ref_list if self.include_event(ref.ref)
    ]
    data.note_list = [h for h in data.note_list if self.include_note(h)]
    data.citation_list = [h for h in data.citation_list if self.include_citation(h)]
    data.media_list = [
        ref for ref in data.media_list if self.include_media(ref.ref)
    ]
    return self.sanitize_family(data)

Direct attribute access (.family_list, .child_ref_list, .father_handle) is
used throughout, compatible with DataDict objects.


include_* — base class defaults

The base ProxyDbBase defaults to including everything (passthrough):

def include_person(self, handle):     return handle is not None
def include_family(self, handle):     return handle is not None
def include_event(self, handle):      return handle is not None
def include_note(self, handle):       return handle is not None
def include_citation(self, handle):   return handle is not None
def include_source(self, handle):     return handle is not None
def include_media(self, handle):      return handle is not None
def include_place(self, handle):      return handle is not None
def include_tag(self, handle):        return handle is not None
def include_repository(self, handle): return handle is not None

Subclasses override only the types they filter.


sanitize_* — base class defaults

Base class sanitize methods are no-ops on the DataDict, returning it unchanged:

def sanitize_person(self, data):  return data
def sanitize_family(self, data):  return data
# ... etc.

Subclasses override these for attribute-level filtering. Because the working medium
is a DataDict, no object reconstruction is needed — just filter list fields:

# Example: PrivateProxyDb.sanitize_person
def sanitize_person(self, data):
    # Strip private attributes from the person's own attribute list
    data.attribute_list = [
        a for a in data.attribute_list if not a.private
    ]
    # Strip private names
    data.alternate_names = [
        n for n in data.alternate_names if not n.private
    ]
    # Strip private addresses
    data.address_list = [
        a for a in data.address_list if not a.private
    ]
    # Strip private LDS ordinances
    data.lds_ord_list = [
        o for o in data.lds_ord_list if not o.private
    ]
    return data

This replaces the existing large sanitize_person(db, person) standalone function
in private.py which manually reconstructed a new Person() object — that
reconstruction is no longer needed.


Performance: Precomputed Handle Sets

The include_*(handle) predicate is called many times during reference list cleanup.
It must be fast — O(1) set membership.

Database interface additions

Two new methods on the database interface:

  • db.is_filter_override(table, filter_name) — returns True if the database has
    a native SQL implementation for this named filter on this table. Collapses into a
    single call the questions of: does the DB support SQL, and does it have SQL for
    this specific filter? Returns False for proxy databases, non-SQL backends, or
    filters that cannot be expressed in SQL.

  • db.apply_filter(table, filter_name) — runs the filter natively and returns a
    list of matching handles. Convenience wrapper around filter.apply(db) so callers
    don’t need to hold a reference to the filter object. Only call when
    is_filter_override returns True.

Two paths for building the included-handle set

Path 1 — Native SQL override (fast, eager)

When db.is_filter_override('person', filter_name) is True:

def _build_included_persons(self):
    if self.db.is_filter_override('person', self.filter_name):
        self._included_persons = set(self.db.apply_filter('person', self.filter_name))
    else:
        self._included_persons = {
            h for h in self.db.iter_person_handles()
            if self._check_person(h)
        }

include_person(handle) is then always O(1):

def include_person(self, handle):
    return handle in self._included_persons

Path 2 — Python-only predicates (lazy)

For expensive predicates like probably_alive(), build the set lazily on first use:

@property
def _included_persons(self):
    if self.__included_persons is None:
        self.__included_persons = {
            h for h in self.db.iter_person_handles()
            if not self._is_living(h)
        }
    return self.__included_persons

self.db.iter_person_handles() composes naturally with inner proxies — only handles
already passing the inner proxy’s filter are iterated.

Summary by proxy

Proxy Filter is_filter_override? Set-build strategy
PrivateProxyDb 'private' Yes (if wrapping real SQL DB) Eager via apply_filter
FilterProxyDb user-defined Yes (if SQL-expressible and real DB) Eager via apply_filter
LivingProxyDb 'living' No (Python-only) Lazy via iter_person_handles
ProxyDbBase n/a n/a handle is not None — no set needed

Cross-Reference Map

Every get_raw_*_data() method strips filtered handles from the DataDict using
include_* predicates:

Object Cross-references to strip
Person family_list, parent_family_list, event_ref_list[*].ref, person_ref_list[*].ref, media_list[*].ref, citation_list, note_list
Family father_handle, mother_handle, child_ref_list[*].ref, event_ref_list[*].ref, media_list[*].ref, citation_list, note_list
Event place_handle, media_list[*].ref, citation_list, note_list
Citation source_handle, media_list[*].ref, note_list
Source reporef_list[*].ref, media_list[*].ref, note_list
Place placeref_list[*].ref, media_list[*].ref, citation_list, note_list
Media citation_list, note_list
Repository note_list

Subclass Responsibilities After Refactor

PrivateProxyDb

  • Override include_*: build precomputed sets via apply_filter (SQL) or iteration
    (fallback). Each set contains handles of non-private objects.
  • Override sanitize_*: strip private sub-attributes (private attribute_list items,
    private alternate_names, private addresses, private LDS ordinances) directly on
    the DataDict. No object reconstruction needed.
  • Remove existing standalone sanitize_person(db, person) etc. functions —
    replaced by the instance methods operating on DataDicts.

LivingProxyDb

  • Override include_person: lazy set built from probably_alive().
  • Override sanitize_person: for modes 1–3 (name replacement), modify the DataDict’s
    name fields directly (e.g., data.primary_name.first_name = "[Living]") instead
    of constructing a new Person().
  • Remove __remove_living_from_family — now handled by base class get_raw_family_data.
  • Remove __restrict_person — replaced by sanitize_person on DataDict.

FilterProxyDb

  • Override include_*: use existing precomputed sets (self.plist, self.flist,
    self.elist, self.nlist) — keep this pattern.
  • Override sanitize_* for note filtering: replace sanitize_notebase calls with
    direct DataDict note_list filtering.
  • Remove all the per-type get_*_from_handle overrides — now handled by base class.

Files to Change

File Change
gramps/gen/proxy/proxybase.py Add default include_* and sanitize_*; rewrite all get_raw_*_data() to raise + clean refs on DataDict + call sanitize_*; make get_*_from_handle delegate to get_raw_*_data
gramps/gen/proxy/private.py Override include_* (SQL-built sets); override sanitize_* (DataDict field filtering); remove old standalone sanitize_* functions and all get_*_from_handle overrides
gramps/gen/proxy/living.py Override include_person (lazy set); override sanitize_person (modify DataDict name fields); remove __restrict_person, __remove_living_from_family
gramps/gen/proxy/filter.py Override include_* (existing sets); override sanitize_* for note filtering on DataDicts; remove per-type get_*_from_handle overrides
gramps/gen/db/base.py Add is_filter_override(table, filter_name) and apply_filter(table, filter_name) to interface (default returns False / raises)
gramps/plugins/db/dbapi/dbapi.py Implement is_filter_override and apply_filter for SQL backends
gramps/gen/proxy/test/proxies_test.py Verify no cross-reference list contains a filtered handle; test DataDict attribute-level sanitization; test chained proxies

Invariant Guaranteed After This Refactor

For any DataDict returned by a proxy’s get_raw_*_data() (or get_*_from_handle()),
every handle appearing in any cross-reference list will pass the proxy’s
include_* predicate. No caller will ever receive a handle that resolves to
None or raises HandleNotFoundError as a result of following a cross-reference.

Direct lookup of a filtered handle raises HandleNotFoundError immediately and
clearly, rather than silently returning None.

Thank you for giving me the opportunity to actually read this in an easy way :slight_smile:

@dsblank Thanks for all the work in designing this.

I am still going through this to understand it, but meanwhile I have a couple of questions to help my understanding.

(1) What about get_*_from_gramps_id?

(2) Trying to understand how DataDict can be used: can ALL places where a Person object is used equally be passed a DataDict instead?

Tim.

Trying to lookup a person by some random string isn’t guaranteed to give back a person, so the return type will be something like PersonLike | None and callers will still have to check to see if they got one.

Yes, DataDict is a drop in replacement (given that it isn’t a Person type). @DavidMStraub has developed the idea of a PersonLike protocol that would allow either Person or a person DataDIct.

In the future, we can probably just have get_person_from_handle() return a PersonLike DataDict, but there are too many moving pieces to do that right now.

If you want to see this plan turned into code, checkout:

Refactor proxy system: two-layer include_*/sanitize_* filtering by dsblank · Pull Request #2195 · gramps-project/gramps · GitHub

Thanks a lot for this initiative. I think it goes in the right direction, however, I strongly believe this is the wrong order of approaching the problem.

Yes, one of the huge benefits of adding type hints is that it exposes bad design decisions (such as the Proxies returning None instead of raising HandleError).

But I think we should resist the temptation to correct these design decisions prematurely, just to make type hinting easier.

On the contrary, I strongly believe we should type hint our entire codebase before a major refactoring. Yes, it will be annoying because we have to document all those quirks (like returning Person | None), but without it, I honestly think the refactoring is too dangerous and will generate a huge maintenance burden after the fact. I see a flood of bug reports of crashes caused by addons that make assumptions that are no longer valid, etc…

As you note we’d have to add protections everywhere get_XXXX_from handle() is used.

I’m not fixing the proxies to allow type hints, but that is a nice side effect of fixing the bug that exists in the proxies. We fix the root, and then make sure all type hints are respected. All plugins, addons, and core code will need to pass those tests.

@dsblank @DavidMStraub

I don’t understand how a Person object and a person DataDict can be replacements for each other. Surely there must be places where a parameter is passed in to a subroutine, and this is then tested for being a Person object (for example, where an object is passed in, and it may be one of several different types, and the subroutine checks, is this a Person object or a Family object, etc). If a DataDict is passed in, instead of an object, then the code will fail.

(sorry if subroutine and type old fashioned, and so are probably the wrong words, here, but you know what I mean)

Indeed, if there are places in the code that check for a specific type, then that will have to change. That is what the “protocol” would be used for.

But in all other uses, a person DataDict is a drop-in replacement for Person. As they say in Python, if it acts like a duck, then it is a duck (eg, so-called duck-typing). There is some Python magic under the hood to make this possible.

If a DataDict is passed in, instead of an object, then the code will fail.

That’s exactly the type of error that the protocol type hint will catch in static code analysis: functions that will work with both type of objects will be typed as PersonLike (so any Person method that is not defined on DataDict will lead to a typing error), will functions that rely on Person being Person and will fail with data dict should be typed with Person. Feeding a variable that could be a DataDict to such functions will also lead to an error (in static code analysis).

I’m not fixing the proxies to allow type hints, but that is a nice side effect of fixing the bug that exists in the proxies. We fix the root, and then make sure all type hints are respected. All plugins, addons, and core code will need to pass those tests.

I trust your judgement. If all can be done in one go (including not breaking third-party code like addons or Web API), even better.