Talk to your tree!

In addition to the sage advice by Doug and David, look not only for the smallest parameter size, but because of the RPi memory limitation, consider smaller model file sizes. This looks interesting to me: qwen3:0.6b and I’ve read that granite4 models use less run-time memory, but what that means IRL, IDK.

BTW, when I was trying to make GrampsWeb chat work over a year ago (before the current tool use implementation), I did not succeed. According to the logs I was running into issues with the context window which I wasn’t able to resolve. So take my advise with a grain of salt :slight_smile:

1 Like

I really appreciate all the help on this! Thank you!

Here is my update for today:

I manage to pull various models (e.g. granite4, smollm2) into the container but the chat still throws an Error 504.

The logs of the container look like some things are going in the right direction:

llama_model_loader: - kv  39:                    tokenizer.chat_template str              = {%- set tools_system_message_prefix =…

llama_model_loader: - kv  40:               general.quantization_version u32              = 2

llama_model_loader: - kv  41:                          general.file_type u32              = 15

llama_model_loader: - type  f32:   81 tensors

llama_model_loader: - type q4_K:  240 tensors

llama_model_loader: - type q6_K:   41 tensors

print_info: file format = GGUF V3 (latest)

print_info: file type   = Q4_K - Medium

print_info: file size   = 1.95 GiB (4.93 BPW)

load: printing all EOG tokens:

load:   - 100257 (‘<|end_of_text|>’)

load:   - 100261 (‘<|fim_pad|>’)

load: special tokens cache size = 96

load: token to piece cache size = 0.6152 MB

print_info: arch             = granite

print_info: vocab_only       = 0

print_info: no_alloc         = 0

print_info: n_ctx_train      = 131072

print_info: n_embd           = 2560

print_info: n_embd_inp       = 2560

print_info: n_layer          = 40

print_info: n_head           = 40

print_info: n_head_kv        = 8

print_info: n_rot            = 64

print_info: n_swa            = 0

print_info: is_swa_any       = 0

print_info: n_embd_head_k    = 64

print_info: n_embd_head_v    = 64

print_info: n_gqa            = 5

print_info: n_embd_k_gqa     = 512

print_info: n_embd_v_gqa     = 512

print_info: f_norm_eps       = 0.0e+00

print_info: f_norm_rms_eps   = 1.0e-05

print_info: f_clamp_kqv      = 0.0e+00

print_info: f_max_alibi_bias = 0.0e+00

print_info: f_logit_scale    = 1.0e+01

print_info: f_attn_scale     = 1.6e-02

print_info: n_ff             = 8192

print_info: n_expert         = 0

print_info: n_expert_used    = 0

print_info: n_expert_groups  = 0

print_info: n_group_used     = 0

print_info: causal attn      = 1

print_info: pooling type     = 0

print_info: rope type        = 0

print_info: rope scaling     = linear

print_info: freq_base_train  = 10000000.0

print_info: freq_scale_train = 1

print_info: n_ctx_orig_yarn  = 131072

print_info: rope_yarn_log_mul= 0.0000

print_info: rope_finetuned   = yes

print_info: model type       = 3B

print_info: model params     = 3.40 B

print_info: general.name     = Granite 4.0 Micro

print_info: f_embedding_scale = 12.000000

print_info: f_residual_scale  = 0.220000

print_info: f_attention_scale = 0.015625

print_info: n_ff_shexp        = 8192

print_info: vocab type       = BPE

print_info: n_vocab          = 100352

print_info: n_merges         = 100000

print_info: BOS token        = 100257 ‘<|end_of_text|>’

print_info: EOS token        = 100257 ‘<|end_of_text|>’

print_info: EOT token        = 100257 ‘<|end_of_text|>’

print_info: UNK token        = 100269 ‘<|unk|>’

print_info: PAD token        = 100256 ‘<|pad|>’

print_info: LF token         = 198 ‘ÄŠ’

print_info: FIM PRE token    = 100258 ‘<|fim_prefix|>’

print_info: FIM SUF token    = 100260 ‘<|fim_suffix|>’

print_info: FIM MID token    = 100259 ‘<|fim_middle|>’

print_info: FIM PAD token    = 100261 ‘<|fim_pad|>’

print_info: EOG token        = 100257 ‘<|end_of_text|>’

print_info: EOG token        = 100261 ‘<|fim_pad|>’

print_info: max token length = 256

load_tensors: loading model tensors, this can take a while… (mmap = false)

load_tensors:          CPU model buffer size =  1998.84 MiB

llama_context: constructing llama_context

llama_context: n_seq_max     = 1

llama_context: n_ctx         = 4096

llama_context: n_ctx_seq     = 4096

llama_context: n_batch       = 512

llama_context: n_ubatch      = 512

llama_context: causal_attn   = 1

llama_context: flash_attn    = auto

llama_context: kv_unified    = false

llama_context: freq_base     = 10000000.0

llama_context: freq_scale    = 1

llama_context: n_ctx_seq (4096) < n_ctx_train (131072) – the full capacity of the model will not be utilized

llama_context:        CPU  output buffer size =     0.39 MiB

llama_kv_cache:        CPU KV buffer size =   320.00 MiB

llama_kv_cache: size =  320.00 MiB (  4096 cells,  40 layers,  1/1 seqs), K (f16):  160.00 MiB, V (f16):  160.00 MiB

llama_context: Flash Attention was auto, set to enabled

llama_context:        CPU compute buffer size =   201.00 MiB

llama_context: graph nodes  = 1329

llama_context: graph splits = 1

time=2026-01-19T19:41:15.562Z level=INFO source=server.go:1376 msg=“llama runner started in 49.21 seconds”

time=2026-01-19T19:41:15.562Z level=INFO source=sched.go:517 msg=“loaded runners” count=1

time=2026-01-19T19:41:15.563Z level=INFO source=server.go:1338 msg=“waiting for llama runner to start responding”

time=2026-01-19T19:41:15.563Z level=INFO source=server.go:1376 msg=“llama runner started in 49.21 seconds”

[GIN] 2026/01/19 - 19:42:21 | 500 |         1m56s |      172.18.0.5 | POST     “/v1/chat/completions”

but the grampweb container log makes me assume that something is wrong with the workers:

File “/usr/local/lib/python3.11/dist-packages/gramps_webapi/api/llm/init.py”, line 162, in answer_with_agent

result = agent.run_sync(prompt, deps=deps, message_history=message_history)

         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

File “/usr/local/lib/python3.11/dist-packages/pydantic_ai/agent/abstract.py”, line 372, in run_sync

return _utils.get_event_loop().run_until_complete(

       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

File “/usr/lib/python3.11/asyncio/base_events.py”, line 640, in run_until_complete

self.run_forever()

File “/usr/lib/python3.11/asyncio/base_events.py”, line 607, in run_forever

self._run_once()

File “/usr/lib/python3.11/asyncio/base_events.py”, line 1884, in _run_once

event_list = self._selector.select(timeout)

             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

File “/usr/lib/python3.11/selectors.py”, line 468, in select

fd_event_list = self._selector.poll(timeout, max_ev)

                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

File “/usr/local/lib/python3.11/dist-packages/gunicorn/workers/base.py”, line 204, in handle_abort

sys.exit(1)

SystemExit: 1

[2026-01-19 19:42:20 +0000] [11] [INFO] Worker exiting (pid: 11)

[2026-01-19 19:42:21 +0000] [10] [ERROR] Worker (pid:11) was sent SIGKILL! Perhaps out of memory?

[2026-01-19 19:42:21 +0000] [15] [INFO] Booting worker with pid: 15

(gunicorn:15): Gtk-CRITICAL **: 19:42:27.186: gtk_icon_theme_get_for_screen: assertion ‘GDK_IS_SCREEN (screen)’ failed

INFO:sentence_transformers.SentenceTransformer:Use pytorch device_name: cpu

INFO:sentence_transformers.SentenceTransformer:Load pretrained SentenceTransformer: sentence-transformers/distiluse-base-multilingual-cased-v2

[2026-01-19 19:45:08 +0000] [10] [CRITICAL] WORKER TIMEOUT (pid:15)

[2026-01-19 19:45:08 +0000] [15] [ERROR] Error handling request /api/chat/

Traceback (most recent call last):

File “/usr/local/lib/python3.11/dist-packages/gunicorn/workers/sync.py”, line 134, in handle

self.handle_request(listener, req, client, addr)

File “/usr/local/lib/python3.11/dist-packages/gunicorn/workers/sync.py”, line 177, in handle_request

respiter = self.wsgi(environ, resp.start_response)

           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

File “/usr/local/lib/python3.11/dist-packages/flask/app.py”, line 1536, in call

return self.wsgi_app(environ, start_response)

       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

File “/usr/local/lib/python3.11/dist-packages/flask/app.py”, line 1511, in wsgi_app

response = self.full_dispatch_request()

           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^

File “/usr/local/lib/python3.11/dist-packages/flask/app.py”, line 917, in full_dispatch_request

rv = self.dispatch_request()

     ^^^^^^^^^^^^^^^^^^^^^^^

File “/usr/local/lib/python3.11/dist-packages/flask/app.py”, line 902, in dispatch_request

return self.ensure_sync(self.view_functions[rule.endpoint])(**view_args)  # type: ignore[no-any-return]

       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

File “/usr/local/lib/python3.11/dist-packages/gramps_webapi/api/auth.py”, line 44, in wrapper

return func(*args, **kwargs)

       ^^^^^^^^^^^^^^^^^^^^^

File “/usr/local/lib/python3.11/dist-packages/flask/views.py”, line 110, in view

return current_app.ensure_sync(self.dispatch_request)(**kwargs)  # type: ignore[no-any-return]

       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

File “/usr/local/lib/python3.11/dist-packages/flask/views.py”, line 191, in dispatch_request

return current_app.ensure_sync(meth)(**kwargs)  # type: ignore[no-any-return]

       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

File “/usr/local/lib/python3.11/dist-packages/webargs/core.py”, line 652, in wrapper

return func(*args, **kwargs)

       ^^^^^^^^^^^^^^^^^^^^^

File “/usr/local/lib/python3.11/dist-packages/webargs/core.py”, line 652, in wrapper

return func(*args, **kwargs)

       ^^^^^^^^^^^^^^^^^^^^^

File “/usr/local/lib/python3.11/dist-packages/gramps_webapi/api/resources/chat.py”, line 85, in post

result = process_chat(

         ^^^^^^^^^^^^^

File “/usr/local/lib/python3.11/dist-packages/celery/local.py”, line 182, in call

return self._get_current_object()(*a, **kw)

       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

File “/usr/local/lib/python3.11/dist-packages/gramps_webapi/util/celery.py”, line 18, in call

return self.run(*args, **kwargs)

       ^^^^^^^^^^^^^^^^^^^^^^^^^

File “/usr/local/lib/python3.11/dist-packages/gramps_webapi/api/tasks.py”, line 660, in process_chat

result = answer_with_agent(

         ^^^^^^^^^^^^^^^^^^

File “/usr/local/lib/python3.11/dist-packages/gramps_webapi/api/llm/init.py”, line 162, in answer_with_agent

result = agent.run_sync(prompt, deps=deps, message_history=message_history)

         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

File “/usr/local/lib/python3.11/dist-packages/pydantic_ai/agent/abstract.py”, line 372, in run_sync

return _utils.get_event_loop().run_until_complete(

       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

File “/usr/lib/python3.11/asyncio/base_events.py”, line 640, in run_until_complete

self.run_forever()

File “/usr/lib/python3.11/asyncio/base_events.py”, line 607, in run_forever

self._run_once()

File “/usr/lib/python3.11/asyncio/base_events.py”, line 1884, in _run_once

event_list = self._selector.select(timeout)

             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

File “/usr/lib/python3.11/selectors.py”, line 468, in select

fd_event_list = self._selector.poll(timeout, max_ev)

                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

File “/usr/local/lib/python3.11/dist-packages/gunicorn/workers/base.py”, line 204, in handle_abort

sys.exit(1)

SystemExit: 1

[2026-01-19 19:45:08 +0000] [15] [INFO] Worker exiting (pid: 15)

[2026-01-19 19:45:09 +0000] [10] [ERROR] Worker (pid:15) was sent SIGKILL! Perhaps out of memory?

[2026-01-19 19:45:09 +0000] [19] [INFO] Booting worker with pid: 19

(gunicorn:19): Gtk-CRITICAL **: 19:45:14.820: gtk_icon_theme_get_for_screen: assertion ‘GDK_IS_SCREEN (screen)’ failed

INFO:sentence_transformers.SentenceTransformer:Use pytorch device_name: cpu

INFO:sentence_transformers.SentenceTransformer:Load pretrained SentenceTransformer: sentence-transformers/distiluse-base-multilingual-cased-v2

for info: as resources on my RPi are limited, I followed the recommendation to limit CPU resoources, as of here: Limit CPU & memory usage - Gramps Web and changed the settings to one worker.

Here is the entire docker-compose.yml:

services:
grampsweb: &grampsweb
image: 

restart: always
ports:
- “80:5000”  # host:docker
environment:
GRAMPSWEB_TREE: “Gramps Web”  # will create a new tree if not exists
GRAMPSWEB_CELERY_CONFIG__broker_url: “redis://grampsweb_redis:6379/0”
GRAMPSWEB_CELERY_CONFIG__result_backend: “redis://grampsweb_redis:6379/0”
GRAMPSWEB_RATELIMIT_STORAGE_URI: redis://grampsweb_redis:6379/1
GUNICORN_NUM_WORKERS: 1
GRAMPSWEB_VECTOR_EMBEDDING_MODEL: sentence-transformers/distiluse-base-multilingual-cased-v2
GRAMPSWEB_LLM_BASE_URL: http://ollama:11434/v1
GRAMPSWEB_LLM_MODEL: granite4
OPENAI_API_KEY: ollama

depends_on:
  - grampsweb_redis
volumes:
  - gramps_users:/app/users  # persist user database
  - gramps_index:/app/indexdir  # persist search index
  - gramps_thumb_cache:/app/thumbnail_cache  # persist thumbnails
  - gramps_cache:/app/cache  # persist export and report caches
  - gramps_secret:/app/secret  # persist flask secret
  - gramps_db:/root/.gramps/grampsdb  # persist Gramps database
  - gramps_media:/app/media  # persist media files
  - gramps_tmp:/tmp

grampsweb_celery:
<<: *grampsweb  # YAML merge key copying the entire grampsweb service config
ports: 

container_name: grampsweb_celery
depends_on:
- grampsweb_redis
command: celery -A gramps_webapi.celery worker --loglevel=INFO --concurrency=1

grampsweb_redis:
image: 

container_name: grampsweb_redis
restart: always

ollama:
image: ollama/ollama
container_name: ollama
ports:
- “11434:11434”
volumes:
- ollama_data:/root/.ollama

volumes:
gramps_users:
gramps_index:
gramps_thumb_cache:
gramps_cache:
gramps_secret:
gramps_db:
gramps_media:
gramps_tmp:
ollama_data:

BTW: I am using an RPI 4 with 8GB RAM with Raspberry Pi OS Desktop.

Unfortunately, this is clearly a RAM issue.

I think the only hope is

Hopefully, there is a model that supports tools and can run on 8 GB RAM.

My experiences with various models so far:

tinyllama → does not support tool calling

smollm2: → Error 504

granite4 → Error 504

granite4:350m → invalid message format

smollm2:135m → invalid message format

smollm2:360m → invalid message format

llama3.2:1b → invalid message format

functiongemma:latest → first tries ended with error 504 (uses less memory 2GB, but more cpu 100%), but at the end I got my first reponse in the chat! :slight_smile:

Q: who are the parents of Adam?

A: I apologize, but I cannot assist with finding parents for specific individuals using the available genealogical tools. My current tools are designed for retrieving genealogical data related to family relationships and events, and cannot access or manage genealogical information for individuals’ parents.

So at least the tree now talks to me :slight_smile: however, the conversation is not yet satisfying :smiley:

I am wondering if this is not also related to the search index? As it takes almost 12 hrs to re-build the search index (with ca. 10k Elements), I haven’t rebuilt it each time I was trying a new model.

2 Likes

I am wondering if this is not also related to the search index? As it takes almost 12 hrs to re-build the search index (with ca. 10k Elements), I haven’t rebuilt it each time I was trying a new model.

The (semantic) search index uses a different model (VECTOR_EMBEDDING_MODEL) which is always local, so you don’t have to reindex when you change the LLM_MODEL.

If you enable debug logging, you will see the individual tool calls by the LLM.