rag explained — llms with company data
Retrieval-Augmented Generation: why not fine-tuning, what the architecture looks like, what GDPR demands.
by tokyn studio · 5 min read

TL;DR. Retrieval-Augmented Generation (RAG) is an architecture where a Large Language Model looks up relevant passages from an external knowledge source before each answer. Benefit: the LLM uses current, company-specific data — without being retrained, with source citations per answer. As of 2026-05.
what is retrieval-augmented generation?
RAG combines two ideas: the knowledge baked into a foundation model, and the data you have inside your company. Instead of retraining the model on your data, it retrieves at query time exactly the passages from your knowledge base that are most likely relevant to the answer — and uses them as additional context when generating.
Concretely: an employee asks the Company-GPT "What are our notice periods for framework contracts?". RAG searches the indexed contracts, finds the five most relevant passages, silently appends them to the question, and lets the LLM formulate the answer — with concrete source citations from the contracts.
why not just fine-tuning?
Fine-tuning (continuing to train the LLM on your own data) sounds more obvious at first. In practice RAG has five decisive advantages:
1. Freshness. New documents become visible to the RAG system the moment they're indexed — minutes, not weeks. Fine-tuning would need a new training run for every data update. 2. Source citations. RAG systems can back every answer with the underlying source. Fine-tuning blurs knowledge into the model — citations are no longer reconstructable. 3. Cost. Fine-tuning a GPT-4-class model quickly runs into four- to five-figure costs depending on data volume. RAG uses the model unchanged, paying only for the search index plus the normal inference price. 4. GDPR advantage. Personal data lands in the RAG index, not in the trained model weights. Deletion requests are technically feasible (remove the record from the index) — practically impossible with fine-tuning. 5. Hallucinations. RAG reduces hallucinations because the model uses explicit sources rather than only trained intuition. When no source is found, it can say "I don't know" — a discipline fine-tuned models struggle to learn.
architecture in 5 steps
1. Ingest. Your documents (PDFs, Word, Confluence, SharePoint, CRM records) are converted into a search index. Texts are split into chunks (typically 200–800 tokens).
2. Embedding. Each chunk gets a vector — a numeric representation of its content. Similar content has similar vectors. Models for this: OpenAI text-embedding-3, Mistral Embed, Nomic, Voyage.
3. Retrieval. When a user asks a question, the question itself is embedded. Vector similarity (cosine similarity) returns the k closest-matching chunks — typically k=5 to 20.
4. Reranking (optional). A second model (e.g. Cohere Rerank) re-orders the k chunks by true relevance to the question. Substantially reduces noise.
5. Generation. The question plus the top-N chunks go into the LLM as a prompt. The LLM answers based on this context, ideally with a source pointer per statement.
typical failure modes
Bad chunks. Too long → the LLM has to filter through noise. Too short → context is missing. Iterative tuning of chunk size is mandatory.
Wrong embedding model. Trained for English ≠ optimised for German. Multilingual models (Mistral Embed, multilingual-E5) outperform English-only options for DACH contexts.
No reranking. Pure vector search often returns chunks that are topically similar but don't answer the question. Reranking is the cheapest quality lift you can do.
No source citations. If the LLM delivers answers without source IDs, nobody can verify. Mandatory in any serious RAG setup: per statement a source ID that points back to the original passage.
Stale index. Without re-indexing on data updates, the system drifts within weeks. Daily delta updates should be the default.
gdpr-relevant aspects
Running RAG on personal data (customer records, HR documents) requires GDPR diligence:
- Per-document access control. When a user asks the RAG system a question, they should only get answers from documents they're authorised to access. This is not "nice to have" — without filtering, the system leaks data across departments.
- Deletion duties. When a customer requests erasure under Art. 17 GDPR, the index needs care — record out, embedding recomputed.
- DPA with embedding provider. Using OpenAI or Mistral as embedding provider requires a data processing agreement (see our [glossary entry GDPR + AI](/glossar#dsgvo-ai)).
- EU hosting possible. With Mistral, Cohere EU, or self-hosted open-source models (Llama, Mixtral), RAG can run entirely EU-only.
when does rag pay off — when not?
Pays off: whenever an LLM needs to access internal data that wasn't in the foundation model's training. Classic: contract / manual lookup, internal knowledge base, customer service with product docs, research assistants in law firms.
Doesn't pay off: when the LLM only processes natural language (translation, summarisation of external texts, code generation). The foundation model alone suffices. RAG adds unnecessary latency and complexity.
what we do at tokyn
We build RAG systems as the core of our [Company-GPT implementations](/company-gpt). Typical delivery time for a productive pilot: 2–4 weeks. Includes: chunk strategy matching the document type, embedding model selection by language and compliance, reranking pipeline, source UI, per-document access control, GDPR setup.
If you have a concrete use case: a [discovery call](/kontakt) clarifies in 30 minutes whether RAG is the right architecture for it.
sources
- [Original paper: Lewis et al., 2020 — Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks](https://arxiv.org/abs/2005.11401)
- [Cohere Rerank documentation](https://docs.cohere.com/docs/rerank-overview)
- [Mistral Embed (multilingual, EU-hosted)](https://docs.mistral.ai/capabilities/embeddings/)
- [tokyn glossary: RAG](/glossar#rag), [Company-GPT](/glossar#company-gpt)
related service
your case, concretely — let's talk.
30 minutes, no pitch deck. We look at your use case and tell you honestly whether — and how — it's worth doing.