I really appreciate all the help on this! Thank you!
Here is my update for today:
I manage to pull various models (e.g. granite4, smollm2) into the container but the chat still throws an Error 504.
The logs of the container look like some things are going in the right direction:
llama_model_loader: - kv 39: tokenizer.chat_template str = {%- set tools_system_message_prefix =…
llama_model_loader: - kv 40: general.quantization_version u32 = 2
llama_model_loader: - kv 41: general.file_type u32 = 15
llama_model_loader: - type f32: 81 tensors
llama_model_loader: - type q4_K: 240 tensors
llama_model_loader: - type q6_K: 41 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type = Q4_K - Medium
print_info: file size = 1.95 GiB (4.93 BPW)
load: printing all EOG tokens:
load: - 100257 (‘<|end_of_text|>’)
load: - 100261 (‘<|fim_pad|>’)
load: special tokens cache size = 96
load: token to piece cache size = 0.6152 MB
print_info: arch = granite
print_info: vocab_only = 0
print_info: no_alloc = 0
print_info: n_ctx_train = 131072
print_info: n_embd = 2560
print_info: n_embd_inp = 2560
print_info: n_layer = 40
print_info: n_head = 40
print_info: n_head_kv = 8
print_info: n_rot = 64
print_info: n_swa = 0
print_info: is_swa_any = 0
print_info: n_embd_head_k = 64
print_info: n_embd_head_v = 64
print_info: n_gqa = 5
print_info: n_embd_k_gqa = 512
print_info: n_embd_v_gqa = 512
print_info: f_norm_eps = 0.0e+00
print_info: f_norm_rms_eps = 1.0e-05
print_info: f_clamp_kqv = 0.0e+00
print_info: f_max_alibi_bias = 0.0e+00
print_info: f_logit_scale = 1.0e+01
print_info: f_attn_scale = 1.6e-02
print_info: n_ff = 8192
print_info: n_expert = 0
print_info: n_expert_used = 0
print_info: n_expert_groups = 0
print_info: n_group_used = 0
print_info: causal attn = 1
print_info: pooling type = 0
print_info: rope type = 0
print_info: rope scaling = linear
print_info: freq_base_train = 10000000.0
print_info: freq_scale_train = 1
print_info: n_ctx_orig_yarn = 131072
print_info: rope_yarn_log_mul= 0.0000
print_info: rope_finetuned = yes
print_info: model type = 3B
print_info: model params = 3.40 B
print_info: general.name = Granite 4.0 Micro
print_info: f_embedding_scale = 12.000000
print_info: f_residual_scale = 0.220000
print_info: f_attention_scale = 0.015625
print_info: n_ff_shexp = 8192
print_info: vocab type = BPE
print_info: n_vocab = 100352
print_info: n_merges = 100000
print_info: BOS token = 100257 ‘<|end_of_text|>’
print_info: EOS token = 100257 ‘<|end_of_text|>’
print_info: EOT token = 100257 ‘<|end_of_text|>’
print_info: UNK token = 100269 ‘<|unk|>’
print_info: PAD token = 100256 ‘<|pad|>’
print_info: LF token = 198 ‘ÄŠ’
print_info: FIM PRE token = 100258 ‘<|fim_prefix|>’
print_info: FIM SUF token = 100260 ‘<|fim_suffix|>’
print_info: FIM MID token = 100259 ‘<|fim_middle|>’
print_info: FIM PAD token = 100261 ‘<|fim_pad|>’
print_info: EOG token = 100257 ‘<|end_of_text|>’
print_info: EOG token = 100261 ‘<|fim_pad|>’
print_info: max token length = 256
load_tensors: loading model tensors, this can take a while… (mmap = false)
load_tensors: CPU model buffer size = 1998.84 MiB
llama_context: constructing llama_context
llama_context: n_seq_max = 1
llama_context: n_ctx = 4096
llama_context: n_ctx_seq = 4096
llama_context: n_batch = 512
llama_context: n_ubatch = 512
llama_context: causal_attn = 1
llama_context: flash_attn = auto
llama_context: kv_unified = false
llama_context: freq_base = 10000000.0
llama_context: freq_scale = 1
llama_context: n_ctx_seq (4096) < n_ctx_train (131072) – the full capacity of the model will not be utilized
llama_context: CPU output buffer size = 0.39 MiB
llama_kv_cache: CPU KV buffer size = 320.00 MiB
llama_kv_cache: size = 320.00 MiB ( 4096 cells, 40 layers, 1/1 seqs), K (f16): 160.00 MiB, V (f16): 160.00 MiB
llama_context: Flash Attention was auto, set to enabled
llama_context: CPU compute buffer size = 201.00 MiB
llama_context: graph nodes = 1329
llama_context: graph splits = 1
time=2026-01-19T19:41:15.562Z level=INFO source=server.go:1376 msg=“llama runner started in 49.21 seconds”
time=2026-01-19T19:41:15.562Z level=INFO source=sched.go:517 msg=“loaded runners” count=1
time=2026-01-19T19:41:15.563Z level=INFO source=server.go:1338 msg=“waiting for llama runner to start responding”
time=2026-01-19T19:41:15.563Z level=INFO source=server.go:1376 msg=“llama runner started in 49.21 seconds”
[GIN] 2026/01/19 - 19:42:21 | 500 | 1m56s | 172.18.0.5 | POST “/v1/chat/completions”
but the grampweb container log makes me assume that something is wrong with the workers:
File “/usr/local/lib/python3.11/dist-packages/gramps_webapi/api/llm/init.py”, line 162, in answer_with_agent
result = agent.run_sync(prompt, deps=deps, message_history=message_history)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File “/usr/local/lib/python3.11/dist-packages/pydantic_ai/agent/abstract.py”, line 372, in run_sync
return _utils.get_event_loop().run_until_complete(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File “/usr/lib/python3.11/asyncio/base_events.py”, line 640, in run_until_complete
self.run_forever()
File “/usr/lib/python3.11/asyncio/base_events.py”, line 607, in run_forever
self._run_once()
File “/usr/lib/python3.11/asyncio/base_events.py”, line 1884, in _run_once
event_list = self._selector.select(timeout)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File “/usr/lib/python3.11/selectors.py”, line 468, in select
fd_event_list = self._selector.poll(timeout, max_ev)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File “/usr/local/lib/python3.11/dist-packages/gunicorn/workers/base.py”, line 204, in handle_abort
sys.exit(1)
SystemExit: 1
[2026-01-19 19:42:20 +0000] [11] [INFO] Worker exiting (pid: 11)
[2026-01-19 19:42:21 +0000] [10] [ERROR] Worker (pid:11) was sent SIGKILL! Perhaps out of memory?
[2026-01-19 19:42:21 +0000] [15] [INFO] Booting worker with pid: 15
(gunicorn:15): Gtk-CRITICAL **: 19:42:27.186: gtk_icon_theme_get_for_screen: assertion ‘GDK_IS_SCREEN (screen)’ failed
INFO:sentence_transformers.SentenceTransformer:Use pytorch device_name: cpu
INFO:sentence_transformers.SentenceTransformer:Load pretrained SentenceTransformer: sentence-transformers/distiluse-base-multilingual-cased-v2
[2026-01-19 19:45:08 +0000] [10] [CRITICAL] WORKER TIMEOUT (pid:15)
[2026-01-19 19:45:08 +0000] [15] [ERROR] Error handling request /api/chat/
Traceback (most recent call last):
File “/usr/local/lib/python3.11/dist-packages/gunicorn/workers/sync.py”, line 134, in handle
self.handle_request(listener, req, client, addr)
File “/usr/local/lib/python3.11/dist-packages/gunicorn/workers/sync.py”, line 177, in handle_request
respiter = self.wsgi(environ, resp.start_response)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File “/usr/local/lib/python3.11/dist-packages/flask/app.py”, line 1536, in call
return self.wsgi_app(environ, start_response)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File “/usr/local/lib/python3.11/dist-packages/flask/app.py”, line 1511, in wsgi_app
response = self.full_dispatch_request()
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File “/usr/local/lib/python3.11/dist-packages/flask/app.py”, line 917, in full_dispatch_request
rv = self.dispatch_request()
^^^^^^^^^^^^^^^^^^^^^^^
File “/usr/local/lib/python3.11/dist-packages/flask/app.py”, line 902, in dispatch_request
return self.ensure_sync(self.view_functions[rule.endpoint])(**view_args) # type: ignore[no-any-return]
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File “/usr/local/lib/python3.11/dist-packages/gramps_webapi/api/auth.py”, line 44, in wrapper
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File “/usr/local/lib/python3.11/dist-packages/flask/views.py”, line 110, in view
return current_app.ensure_sync(self.dispatch_request)(**kwargs) # type: ignore[no-any-return]
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File “/usr/local/lib/python3.11/dist-packages/flask/views.py”, line 191, in dispatch_request
return current_app.ensure_sync(meth)(**kwargs) # type: ignore[no-any-return]
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File “/usr/local/lib/python3.11/dist-packages/webargs/core.py”, line 652, in wrapper
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File “/usr/local/lib/python3.11/dist-packages/webargs/core.py”, line 652, in wrapper
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File “/usr/local/lib/python3.11/dist-packages/gramps_webapi/api/resources/chat.py”, line 85, in post
result = process_chat(
^^^^^^^^^^^^^
File “/usr/local/lib/python3.11/dist-packages/celery/local.py”, line 182, in call
return self._get_current_object()(*a, **kw)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File “/usr/local/lib/python3.11/dist-packages/gramps_webapi/util/celery.py”, line 18, in call
return self.run(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^
File “/usr/local/lib/python3.11/dist-packages/gramps_webapi/api/tasks.py”, line 660, in process_chat
result = answer_with_agent(
^^^^^^^^^^^^^^^^^^
File “/usr/local/lib/python3.11/dist-packages/gramps_webapi/api/llm/init.py”, line 162, in answer_with_agent
result = agent.run_sync(prompt, deps=deps, message_history=message_history)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File “/usr/local/lib/python3.11/dist-packages/pydantic_ai/agent/abstract.py”, line 372, in run_sync
return _utils.get_event_loop().run_until_complete(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File “/usr/lib/python3.11/asyncio/base_events.py”, line 640, in run_until_complete
self.run_forever()
File “/usr/lib/python3.11/asyncio/base_events.py”, line 607, in run_forever
self._run_once()
File “/usr/lib/python3.11/asyncio/base_events.py”, line 1884, in _run_once
event_list = self._selector.select(timeout)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File “/usr/lib/python3.11/selectors.py”, line 468, in select
fd_event_list = self._selector.poll(timeout, max_ev)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File “/usr/local/lib/python3.11/dist-packages/gunicorn/workers/base.py”, line 204, in handle_abort
sys.exit(1)
SystemExit: 1
[2026-01-19 19:45:08 +0000] [15] [INFO] Worker exiting (pid: 15)
[2026-01-19 19:45:09 +0000] [10] [ERROR] Worker (pid:15) was sent SIGKILL! Perhaps out of memory?
[2026-01-19 19:45:09 +0000] [19] [INFO] Booting worker with pid: 19
(gunicorn:19): Gtk-CRITICAL **: 19:45:14.820: gtk_icon_theme_get_for_screen: assertion ‘GDK_IS_SCREEN (screen)’ failed
INFO:sentence_transformers.SentenceTransformer:Use pytorch device_name: cpu
INFO:sentence_transformers.SentenceTransformer:Load pretrained SentenceTransformer: sentence-transformers/distiluse-base-multilingual-cased-v2
for info: as resources on my RPi are limited, I followed the recommendation to limit CPU resoources, as of here: Limit CPU & memory usage - Gramps Web and changed the settings to one worker.
Here is the entire docker-compose.yml:
services:
grampsweb: &grampsweb
image:
restart: always
ports:
- “80:5000” # host:docker
environment:
GRAMPSWEB_TREE: “Gramps Web” # will create a new tree if not exists
GRAMPSWEB_CELERY_CONFIG__broker_url: “redis://grampsweb_redis:6379/0”
GRAMPSWEB_CELERY_CONFIG__result_backend: “redis://grampsweb_redis:6379/0”
GRAMPSWEB_RATELIMIT_STORAGE_URI: redis://grampsweb_redis:6379/1
GUNICORN_NUM_WORKERS: 1
GRAMPSWEB_VECTOR_EMBEDDING_MODEL: sentence-transformers/distiluse-base-multilingual-cased-v2
GRAMPSWEB_LLM_BASE_URL: http://ollama:11434/v1
GRAMPSWEB_LLM_MODEL: granite4
OPENAI_API_KEY: ollama
depends_on:
- grampsweb_redis
volumes:
- gramps_users:/app/users # persist user database
- gramps_index:/app/indexdir # persist search index
- gramps_thumb_cache:/app/thumbnail_cache # persist thumbnails
- gramps_cache:/app/cache # persist export and report caches
- gramps_secret:/app/secret # persist flask secret
- gramps_db:/root/.gramps/grampsdb # persist Gramps database
- gramps_media:/app/media # persist media files
- gramps_tmp:/tmp
grampsweb_celery:
<<: *grampsweb # YAML merge key copying the entire grampsweb service config
ports:
container_name: grampsweb_celery
depends_on:
- grampsweb_redis
command: celery -A gramps_webapi.celery worker --loglevel=INFO --concurrency=1
grampsweb_redis:
image:
container_name: grampsweb_redis
restart: always
ollama:
image: ollama/ollama
container_name: ollama
ports:
- “11434:11434”
volumes:
- ollama_data:/root/.ollama
volumes:
gramps_users:
gramps_index:
gramps_thumb_cache:
gramps_cache:
gramps_secret:
gramps_db:
gramps_media:
gramps_tmp:
ollama_data:
BTW: I am using an RPI 4 with 8GB RAM with Raspberry Pi OS Desktop.