back to blog

I Gave My Vault a Memory

In A Walkie-Talkie to My Vault, I solved capture. Text a thought, tap Today, done. The vault writes itself.

That worked. After months of daily logging, the vault had grown dense. Thousands of notes: goals, lessons, VulWall decisions, Roamler retrospectives, weekly reviews, ideas I'd completely forgotten having.

Then I wanted to actually use it.

Not search it. Not scroll it. Talk to it. Send a message, get an answer that pulls from everything I've written: the lesson from three months ago, the decision I made last week, the goal I set in January. Without remembering where I filed it.

I added one character to the bot. Now ? at the start of any message means: ask the vault.


The Setup

Same Telegram bot. Same Go service. I added a FastAPI sidecar that handles embeddings and retrieval, and wired it in.

The architecture is simple:

Obsidian vault (markdown files)
        โ†“
    FastAPI sidecar
    LanceDB (local, on disk)
    Ollama (nomic-embed-text, 768-dim)
        โ†“
    Telegram bot
    "?" prefix triggers retrieval
        โ†“
    Groq LLM answers from retrieved context

No OpenAI embeddings. No managed vector DB. Everything retrieval-related runs on the home lab next to the vault. The retrieval stack is local; Groq still handles final answer generation.


Embeddings Without a Cloud Dependency

The first decision: which embedding model.

I'm already running Ollama for other things. nomic-embed-text is 137M parameters, pulls in seconds, runs comfortably on CPU, and produces 768-dimensional vectors that punch well above their weight in retrieval benchmarks. It's become the default for local RAG setups for good reason.

ollama pull nomic-embed-text

That's the entire setup. The sidecar talks to Ollama over HTTP at localhost:11434. No API key. No rate limit. No bill.


The obvious implementation (embed the query, find nearest vectors, return top-k) works fine for semantic questions. "What are my thoughts on delegation?" retrieves notes that mean delegation even if they don't say the word.

But it falls apart on specifics. If I ask "what was the outcome of the Roamler retro in April?", a pure vector search returns notes that are about retrospectives, not the one from April or the specific decisions that came out of it. Dense embeddings compress meaning; they wash out rare tokens, proper nouns, and dates.

The fix is hybrid search: BM25 for exact token matching, vector for semantic matching, and reciprocal rank fusion (RRF) to merge the two result lists into a single ranking.

query โ†’ embed โ†’ vector search   โ”€โ”
query โ†’ tokenize โ†’ BM25 search  โ”€โ”ผโ†’ RRF โ†’ top-k chunks โ†’ LLM

LanceDB supports this natively with one call:

results = (
    table.search(query_type="hybrid", vector_column_name="vector")
    .vector(query_vector)
    .text(query_text)
    .rerank(reranker=RRFReranker())
    .limit(6)
    .to_list()
)

Proper nouns, dates, and project names now land correctly because BM25 catches what the dense model misses.


Chunking and Indexing

Notes are chunked into paragraph-shaped blocks, capped around 800 characters, with 120 characters of overlap between chunks. The overlap is the part most tutorials skip: without it, a sentence that straddles a chunk boundary gets split in half and retrieval misses it.

Each chunk is stored with:

  • The text
  • Its embedding vector
  • The note's vault-relative path as a key (Daily notes/2026/04/2026-04-12.md)
  • A SHA-256 hash of the full note content

The hash is the idempotency key. When the reindex job runs, it fetches the current hash map from the sidecar in one call, then walks the vault. If a file's hash matches what's already indexed, it's skipped entirely. A vault with 2,000 notes where 10 changed since yesterday means 10 embed calls, not 2,000.

Bulk reindex runs on demand via /reindex in the bot. New notes get indexed automatically when I add them. Fast enough that I don't notice it.


Parallel Embeddings

One thing that bit me early: the first version of embed_batch was sequential.

# wrong
for t in texts:
    r = await client.post(ollama_url, ...)
    out.append(r.json()["embedding"])

For a note with 8 chunks, that's 8 serial HTTP round-trips to Ollama. Multiply by 2,000 notes on first index and you're looking at minutes of wall time for what should be seconds.

The fix is asyncio.gather:

async def _one(t: str) -> list[float]:
    r = await client.post(ollama_url, json={"model": MODEL, "prompt": t}, timeout=60.0)
    r.raise_for_status()
    return r.json()["embedding"]

return list(await asyncio.gather(*(_one(t) for t in texts)))

All chunks for a note go out concurrently. In my setup, Ollama still ends up doing the actual embedding work effectively serially on CPU, but the HTTP overhead stops stacking.


The FTS Index Trap

Another non-obvious issue: LanceDB's BM25 index needs an explicit build step.

The naive implementation calls create_fts_index after every /index call. That's an O(n) rebuild over the entire table, once per document during a vault reindex. For 2,000 notes, that's 2,000 full rebuilds of a growing index.

The fix: accept a rebuild_fts flag on the /index endpoint. The bulk reindex caller sets it to false for every document, then calls /rebuild-fts once at the very end. One rebuild, not 2,000.

Small detail. Massive difference in reindex time.


Asking the Vault

From the user side, nothing changed. Same bot, same Telegram thread. The only difference is the prefix.

Without ?: the message goes into today's daily note. With ?: the message goes to the vault as a question.

me:  ? what were my main takeaways from the Roamler retro in April?

bot: Based on your notes from April 14 and the weekly review from W16:
     The main theme was async communication overhead...

The sidecar retrieves 6 chunks, the bot assembles them into a context block, and Groq answers against that context. The source note keys are passed into the prompt for grounding, even though the bot's final reply stays clean and citation-free.

One thing I didn't expect: the detail hierarchy doing real work. My vault has daily logs (raw, specific), weekly reviews (themes), and quarterly reviews (narrative arcs). The chunking doesn't distinguish between them. A query about the Roamler retro retrieves from all three levels simultaneously: a daily note with the raw action items, a weekly review that grouped them into themes, a quarterly review that named it a turning point. The answer has natural depth because the source material was written at multiple levels of abstraction.

That wasn't designed. It emerged from the structure the walkie-talkie system wrote into the vault.

I've been using it to surface things I'd genuinely forgotten: decisions I made and didn't act on, lessons I wrote down and then ignored, patterns across weeks of daily logs that I couldn't see one day at a time. Generic RAG demos use docs or Wikipedia. When the source material is your own writing, the answers feel unnervingly on-point. It's not retrieving similar documents. It's retrieving me.


Steal This

The pieces are all open source and local.

  • Ollama + nomic-embed-text. Free, fast, no API key. ollama pull nomic-embed-text and you have a production-quality embedding model.
  • LanceDB. Embedded vector DB, runs in-process or as a sidecar, no infrastructure. Hybrid search built in.
  • Content-hash idempotency. SHA-256 the file, store the hash, skip unchanged files on reindex. Your reindex job goes from slow to instant after the first run.
  • Hybrid BM25 + vector + RRF. Pure vector search misses proper nouns, dates, and rare terms. Hybrid catches them. The extra setup is ten lines.
  • One rebuild at the end. Don't rebuild the FTS index after every document during bulk reindex. Rebuild it once when the loop finishes.
  • Use your existing interface. I already had a Telegram bot. Adding ? as a retrieval prefix cost me an if-statement and a new handler. No new app. No new habit.

The vault was already healthy. Now it answers back.

find me on