Discussion: minimal, anonymized, opt-out telemetry for Gramps Web API

This is a question to existing Gramps Web users:

How would you feel about Gramps Web (API) adding anonymized telemetry that would send a unique identifier of the instance/tree (not based on any of the data in the tree or the installation) plus the current API version number to a statistics endpoint by default, unless you opt out by changing a flag in the configuration?

Here is the rationale for this idea:

  1. We currently have no idea how many users are actively using Gramps Web, but the number is clearly increasing. From docker pull counts, I would estimate we currently have well over 5000 installations, but it’s very hard to tell. For comparison, Webtrees has detailed statistics from their telemetry (which AFAIK is not even opt-out).
  2. Apart from just being nice to know, a strong reason why knowing the rough number of active installations is important is map tiles. I want to keep having free map tiles as a default, but we don’t want to overload a free tile service with a huge number of users. (By the way, Grampshub uses the subscription fees to pay for tiles from MapTiler.)
  3. Opt-out rather than opt-in is purely for statistical reasons: I assume only a small fraction of users would actively enable statistics (out of mere inertia), and the fraction of users that enable it would be impossible to determine, rendering the statistics useless.

The reason for starting this thread is to get an impression of whether users think this is reasonable.

Technically, it would mean that, roughly every 24 hours, a JSON of this form would be sent to a statistics server:

{
    "tree_uuid": "86f9fe91-3d5d-4ba8-acd2-d9ebaedf563e",
    "server_uuid": "dd5a8089-e3d4-4208-a75a-5a7e4afb39eb",
    "api_version": "3.1.0",
}

The UUIDs would be stored in a cache directory, allowing e.g. to configure resetting them every x days. Disabling it completely would also be possible.

1 Like

My thoughts:

It would be very interesting and useful to get some sort of idea how
Gramps is used. A bit of telemetry should not be a problem. I think
there are a few requirements:

  1. Make the opt-out easy to find.
  2. Run in the background. No impact to performance.
  3. Fail silently.
  4. Log the attempts.
  5. It would be useful to include operating system.
  6. Include whether running as a docker/container/snap/flatpak/appimage
    (etc) or as a native executable.
  7. Perhaps have several levels of reporting, similar to what the KDE
    project has. Default would be the basics. The next level might include
    my items 5 and 6, and another level with additional data such as size of
    the active database.
  8. If opt-out is chosen, then send one last item requesting database purge.
2 Likes

As a hobbyist, I have no concerns about sharing the data you mentioned. I imagine that more usage data may be collected in the future to learn about usage of the product (I understand the value of that), so it would be important to publish the data collection policy on a web site reassuring users of anonymity, privacy and security of the data. Also, do you know whether GDPR or other regulations apply to such data collection?

Finally, would it be better to send data only if one of the three collected values have changed?

1 Like

My naive understanding is that by ensuring that there is no personally identifiable information is collected, GDPR does not apply.

I don’t think this makes sense. The UUIDs are supposed to be constant, so they will not change by definition. And the version might be constant for a long time if there is no new backend release. Since the point is to report active installations, I think it makes sense to do it as a fixed interval.

My idea to implement it in practice without having to set up a cronjob (or celery beat) would be:

  • On every request, in before_request check when last ping was sent. If less than 24 h or disable flag is set → do nothing
  • If more than 24 hours, dispatch a telemetry background task to celery and continue with the request → request duration not increased noticable as the operation should only take ms and does not have to wait for a remote server to respond
  • Telemetry background task: post the above JSON to the telemetry endpoint.
1 Like

Any sort of telemetry tends to make me squeamish; as the saying goes,
“it’s not paranoia if they really are out to get you” :).

That being said, this info seems pretty innocuous and I would possibly
even opt-in. Before doing so, however, I think there would need to be
some things done:

  1. Provide a very clear, easily found policy and description of
    exactly how this info is generated, collecte, and used.

  2. The opt-in/opt-out question must be clear and asked for upon
    installation, and on subsequent upgrades as a reminder that it is
    happening.

  3. Once a day seems far too often to me. Once a week seems better,
    like Debian popcon, but that’s a personal opinion. It kind of depends
    on the next item, I think …

  4. What does “active” mean in this case? If you’re just trying to
    count the number of installations, sending the JSON once after an
    installation is running (with approval) seems to be sufficient. If
    you’re trying to see how often that particular site is actually being
    used (logins, changes to data, queries to the db, and so on), then I
    don’t see how this will tell you that. Just as an example, my site
    can go for days, even weeks, without any activity as other research
    gets done (sometimes I even shut down the cloud instance). Is that
    “active” or not?

Just my $0.02 worth …

1 Like

Maybe GDPR applies: Telemetry data + I.P from where telemetry data is coming from :thinking:

That’s why I think the IP must not be stored.

I think this doesn’t work, because (as can be seen on the forum) people often wipe there installs and start again - in fact being able to do that and being able to port the data is an important feature. If we only ping once after install, the number of installations would rise monotonically, but would be completly unrealistic. To get a realistic estimate of the instances actually running, we need some “keepalive” signal to also detect of instances disappear.

That being said, I am not too attached to the daily rythm, weekly could indeed work as well.

2 Likes

Ah, okay. I see what you’re trying to capture now. That makes sense, then. “Active” means “still running” vs “I’ve just installed it recently.”

2 Likes