Deploying on EC2 instance locks up machine upon starting Docker containers

Hi there! New Gramps user here. I’ve been playing arounds with Gramps Web locally the past few weeks, it seems like a great piece of software! I’m looking to deploy it to share with my family and ran into a couple issues when deploying it to an EC2 instance on AWS. For context, I’m a professional full-stack software engineer.

I’ve followed the steps from the docs here Docker with Let's Encrypt - Gramps Web, using the content of the linked docker-compose.yml and nginx_proxy.conf files, adding in my values for the VIRTUAL_HOST, LETSENCRYPT_HOST, and LETSENCRYPT_EMAIL environment variables.

On a new, small EC2 instance (2 vCPUs and 2GB RAM), I installed Docker, made a new directory, created those two config files and then ran docker compose up -d. Everything starts to run for 5-10 seconds, as seen via the logs (docker compose logs -f), but then the machine locks up. The log output stops, keyboard input stops being received, and eventually the SSH connection drops. I can’t log back into the machine without rebooting it through AWS, and sometimes the machine is so deadlocked the reboot command isn’t even received; I have to stop and start the instance entirely. After I manage to get access to the machine and start the docker containers again, the machine locks up again. I’ve tried this a few times, reusing the same instance and making a new EC2 instance, and I’ve gotten the same result each time.

These are the last logs I see before the output stops:

docker logs output
grampsweb_celery  | [2025-01-12 18:17:26,500: INFO/MainProcess] Connected to redis://grampsweb_redis:6379/0
grampsweb_celery  | [2025-01-12 18:17:26,500: WARNING/MainProcess] /usr/local/lib/python3.11/dist-packages/celery/worker/consumer/consumer.py:508: CPendingDeprecationWarning: The broker_connection_retry configuration setting will no longer determine
grampsweb_celery  | whether broker connection retries are made during startup in Celery 6.0 and above.
grampsweb_celery  | If you wish to retain the existing behavior for retrying connections on startup,
grampsweb_celery  | you should set broker_connection_retry_on_startup to True.
grampsweb_celery  |   warnings.warn(
grampsweb_celery  |
grampsweb_celery  | [2025-01-12 18:17:26,506: INFO/MainProcess] mingle: searching for neighbors
nginx-proxy-acme  | [Sun Jan 12 18:17:27 UTC 2025] Getting webroot for domain='family.REDACTED.com'
nginx-proxy-acme  | [Sun Jan 12 18:17:27 UTC 2025] Verifying: family.REDACTED.com
grampsweb_celery  | [2025-01-12 18:17:27,517: INFO/MainProcess] mingle: all alone
grampsweb_celery  | [2025-01-12 18:17:27,529: INFO/MainProcess] celery@0db449d1b4f3 ready.
grampsweb         | [2025-01-12 18:17:27 +0000] [12] [INFO] Starting gunicorn 23.0.0
grampsweb         | [2025-01-12 18:17:27 +0000] [12] [INFO] Listening at: http://0.0.0.0:5000 (12)
grampsweb         | [2025-01-12 18:17:27 +0000] [12] [INFO] Using worker: sync
grampsweb         | [2025-01-12 18:17:27 +0000] [13] [INFO] Booting worker with pid: 13
grampsweb         | [2025-01-12 18:17:27 +0000] [14] [INFO] Booting worker with pid: 14
grampsweb         | [2025-01-12 18:17:27 +0000] [15] [INFO] Booting worker with pid: 15
grampsweb         | [2025-01-12 18:17:27 +0000] [16] [INFO] Booting worker with pid: 16
grampsweb         | [2025-01-12 18:17:27 +0000] [17] [INFO] Booting worker with pid: 17
grampsweb         | [2025-01-12 18:17:27 +0000] [18] [INFO] Booting worker with pid: 18
grampsweb         | [2025-01-12 18:17:27 +0000] [19] [INFO] Booting worker with pid: 19
grampsweb         | [2025-01-12 18:17:27 +0000] [20] [INFO] Booting worker with pid: 20
nginx-proxy-acme  | [Sun Jan 12 18:17:27 UTC 2025] Pending. The CA is processing your order, please wait. (1/30)
nginx-proxy-acme  | [Sun Jan 12 18:17:30 UTC 2025] Pending. The CA is processing your order, please wait. (2/30)
nginx-proxy-acme  | [Sun Jan 12 18:17:33 UTC 2025] Pending. The CA is processing your order, please wait. (3/30)

grampsweb boots 8 workers, so my hunch is that those workers are running into some sort of resource contention. Why is this happening and what can I do to stop it from happening? What other logs or traces can I gather to get more detailed information about what’s happening?

Also, this line in the set up instructions

On first run, the app will display a first-run wizard

This does not happen. Maybe that’s the culprit, something’s locked waiting on terminal input? I’m not sure where that first-run wizard is supposed to appear since the containers in the background with the -d option to docker compose up.

1 Like

Sounds like you could help us improve the deployment docs :laughing:

Could it be you are hitting an OOM? Have you checked starting with a small number of workers, or alternatively starting with a bigger node to see if the problem persists?

https://www.grampsweb.org/install_setup/cpu-limited/

Hi @sleuth
Most of your log looks familiar, except for this line below. Have you investigated that?

nginx-proxy-acme  | [Sun Jan 12 18:17:33 UTC 2025] Pending. The CA is processing your order, please wait. (3/30)

I believe that you’d have to get the the certificate issue resolved before you can get to the wizard which you would see when you connect to your server via the web interface.

That seems to have done the trick! I set the GUNICORN_NUM_WORKERS to 2 and the containers all started up without any deadlocking issues.

I was then confused for a bit as things seemed to non-functioning despite being responsive. I had to pay attention to these lines in the logs:

nginx-proxy-acme  | [Mon Jan 13 04:01:18 UTC 2025] REDACTED.com: Invalid status. Verification error details: 12.34.56.78: Fetching http://REDACTED.com/.well-known/acme-challenge/PtHQ-__HE1wVpb5rnUSaJlZsr82KzOx01EE_yjP4NIw: Timeout during connect (likely firewall problem)
nginx-proxy-acme  | [Mon Jan 13 04:01:18 UTC 2025] Please check log file for more details: /dev/null
nginx-proxy-acme  | Sleep for 3600s

I forgot to allow inbound traffic on the server :man_facepalming: I’d just gone with the default which is just SSH. After allowing port 80 and 443 and restarting the docker containers everything booted up just fine and I was able to access the UI on my domain :sunglasses:


After importing my tree and getting things running and monitoring the machine for a bit, things definitely seems to be memory bound vs CPU bound. The CPU is >99% idle, but only has 149 MiB free out of 1843 total.

I tried increasing the number of workers to 4 and the machine didn’t lock up entirely, but it got very unresponsive. CPU usage reported as spending 92% waiting. That lasted for a couple minutes, with startup seemingly stuck. Then the deadlock resolved, with the server able to startup and respond to requests. The % spent waiting dropped to ~25% but stayed there, and then grampsweb_celery workers seemed to be stuck in a boot loop. They would start up, print out all of their available tasks a few seconds later, then almost immediately exit with code 0, then start another a few seconds later. During these loops free memory goes between 50 and 250 MB free. When I tried to load the web UI the server locked up, jumping up to 80-90% CPU time spent waiting.

Given all that, it seems like a memory issue rather than a CPU one that was causing the deadlock, where the RAM would run low, use swap space on disk, and then the CPU would get stuck waiting on all of the disk I/O.

Happy to contribute a note about this! Any suggestion for where to add it? Perhaps a troubleshooting section at the bottom of the Deploy with Docker page?

Thanks! Yes, I think that’s a good idea.

By the way, another knob to try out regarding memory usage is flask"s --preload option (by overrding the docker command in your compose file, using this with --preload appended). We should probably make this the default anyway.