RAG architecture

Retrieval-augmented generation (RAG) works in AarhusAI with external ingestion and retrieval outside of Open WebUI into two dedicated microservices that share a single Qdrant index:

ingestion-service - extracts, chunks, embeds and writes documents to Qdrant.
retrieval-agent - reads Qdrant at query time, optionally wrapping the search in an LLM-driven reasoning loop.

Overview
Architecture at a glance
The shared Qdrant contract
Ingestion path
Retrieval path
Configuration touchpoints
Operations and health
Reference

1. Overview

Previously Open WebUI ran the whole RAG flow internally: it extracted text (optionally through an external document loader), chunked it, embedded it, stored vectors in its own per-knowledge collections, and ran a fixed linear retrieval at query time.

Why the change:

Control over extraction and chunking. The ingestion-service runs a Haystack v2 pipeline with pluggable extraction engines (kreuzberg, pypdf, docling, unstructured) and chunking strategies (token, markdown, sentence, …) selected by configuration. A content-based auto mode additionally routes layout-bound documents (flowcharts, diagrams, scanned forms) to a multimodal vision path (vision-llm / hybrid-diagram) while everything else takes a plain text extractor.
Hybrid search and reranking. Ingestion can write a sparse vector alongside the dense one, letting retrieval use Qdrant’s native RRF hybrid query plus optional cross-encoder reranking.
Agentic retrieval. The retrieval-agent can wrap search in a PydanticAI loop that rewrites queries, decomposes them, grades relevance, and retries - or fall back to a fast linear pipeline.
Decoupling. Either service can be swapped, scaled, or reconfigured without touching the Open WebUI fork.

The older “document ingestion route” described in Deployment was only a text extraction proxy - Open WebUI still embedded and stored the result itself. The ingestion-service replaces that by taking over embedding and storage as well.

2. Architecture at a glance

Both services are standalone FastAPI microservices. They never call each other their only shared state is the Qdrant index.

flowchart TB
  user([user])

  subgraph owb["Open WebUI (patched fork)"]
    upload["file upload /<br/>knowledge base"]
    chat["chat / RAG query"]
  end

  subgraph ingest_box["ingestion-service - PUT /api/v1/ingest"]
    ING["Haystack pipeline<br/>extract → chunk → embed → write"]
  end

  subgraph retrieve_box["retrieval-agent - POST /search"]
    RET["linear or agentic<br/>retrieval pipeline"]
  end

  S3[("S3 / MinIO<br/>raw files")]
  SIDE["Kreuzberg<br/>text extraction sidecar"]
  GOT["Gotenberg<br/>office→PDF render sidecar"]
  VLM["Vision LLM endpoint<br/>(multimodal, VISION_LLM_*)"]
  EMB["Embedding endpoint<br/>(embed.itkdev.dk / TEI / fastembed)"]
  LLM["LiteLLM proxy<br/>(agent + query generation)"]
  RRK["Reranker endpoint<br/>(embed.itkdev.dk /v1/rerank)"]
  QD[("Qdrant<br/>single collection: ingestion_files<br/>dense + optional sparse vectors")]

  user --> owb
  upload -->|"upload then {bucket,key}"| S3
  upload -->|"EXTERNAL_INGESTION_ENGINE=external"| ING
  chat -->|"RAG_EXTERNAL_RETRIEVAL_API_KEY"| RET

  ING -->|fetch by key| S3
  ING -->|HTTP extract| SIDE
  ING -.vision engines.-> GOT
  ING -.vision engines.-> VLM
  ING -->|embed docs| EMB
  ING -->|write points| QD

  RET -->|embed query| EMB
  RET -.->|agent decisions| LLM
  RET -.->|rerank candidates| RRK
  RET -->|search points| QD

Solid arrows are always-on; dashed arrows are optional. On the ingestion side, Gotenberg and the Vision LLM endpoint are only used when a vision engine is selected (vision-llm, hybrid-diagram, or an auto-routed diagram). On the retrieval side, the LLM is only consulted in agentic mode or when query generation is enabled, and the reranker only when ENABLE_RERANKING is enabled.

3. The shared Qdrant contract

This is the most important thing to get right. Both services talk to the same physical Qdrant collection (ingestion_files by default) but neither discovers the other’s schema - the compatibility is entirely a matter of matching configuration. A mismatch produces silently wrong results (garbage similarity scores, empty hits) rather than errors.

The settings that must agree across ingestion-service and retrieval-agent:

Setting	ingestion-service	retrieval-agent	Why it must match
Qdrant instance	`QDRANT_URI`	`QDRANT_URI`	Same store, or retrieval reads nothing
Collection	`QDRANT_INDEX` (`ingestion_files`)	`QDRANT_INDEX` (`ingestion_files`)	Must name the same physical collection
Dense model	`EMBEDDING_MODEL`	`EMBEDDING_MODEL`	Different models → incomparable vector spaces
Doc/query prefix	`EMBEDDING_PREFIX_DOC` (e.g. `"passage: "`)	`EMBEDDING_PREFIX_QUERY` (e.g. `"query: "`)	e5/nomic models expect their trained prefix
Sparse model	`SPARSE_EMBEDDING_MODEL` (when sparse on)	`SPARSE_QUERY_MODEL` (when hybrid on)	Sparse query vectors must match the indexed ones

Multitenancy. Every chunk is written with a meta.collection_name payload (e.g. file-abc, or a knowledge-base name) and Qdrant is configured for per-tenant subgraphs keyed on that field (hnsw_config={"m": 0, "payload_m": 16}). The ingestion-service bootstraps the supporting keyword payload indexes at startup: meta.collection_name (the tenant key, is_tenant=True), meta.collection_type, meta.languages (ISO 639-1 codes surfaced by Kreuzberg, enabling language filtering), plus meta.file_id and meta.ingest_version (which back the versioned-overwrite sweep described in Ingestion path). At query time the retrieval-agent passes collection_names and filters on meta.collection_name ∈ collection_names, so one Open WebUI knowledge base never bleeds into another even though they share one physical collection.

The dense embedding model is the contract that breaks most quietly. If ingestion indexed with intfloat/multilingual-e5-large and "passage: " prefixes, the retrieval-agent must query with the same model and the matching "query: " prefix - otherwise scores are meaningless but no error is raised.

4. Ingestion path

Open WebUI uploads the raw file to S3/MinIO first, then calls PUT /api/v1/ingest with the bucket and key (a multipart body is the fallback for direct uploads). The S3 path is preferred because it keeps large files off the FastAPI worker’s heap. Requests are gated before any work happens: a per-file_id lock serializes concurrent ingests of the same file, MAX_UPLOAD_BYTES (100 MB) caps both multipart parts and S3 fetches (checked via head_object), the S3_ALLOWED_BUCKETS allow-list restricts which buckets may be read, and the caller’s collection_name is validated against the user_id / file_id it claims.

The request handler hands the file to a Haystack v2 pipeline built once at startup and cached as module state:

flowchart LR
  F["raw file<br/>(tempfile)"] --> C["Converter<br/>EXTRACTION_ENGINE"]
  C -->|Documents| S["Splitter<br/>CHUNK_SPLIT_BY"]
  S -->|chunks + meta| DE["Dense embedder<br/>EMBEDDING_PROVIDER"]
  DE -. optional .-> SE["Sparse embedder<br/>ENABLE_SPARSE_EMBEDDINGS"]
  DE --> W["Writer<br/>QdrantDocumentStore"]
  SE --> W
  W --> QD[("Qdrant")]

  classDef optional stroke-dasharray:5 5;
  class SE optional;

Convert - turn the raw file into Haystack Documents. The engine is chosen by EXTRACTION_ENGINE: kreuzberg is an HTTP sidecar in the parent stack, pypdf is in-process (PDF-only), docling/unstructured need optional dependencies, and vision-llm/hybrid-diagram are the multimodal vision engines. auto is not an engine but a routing mode that picks one per document. Kreuzberg additionally surfaces document metadata (title, subject, authors, created date, languages) and renders tables as Markdown. See Vision extraction and content-based routing below.
Chunk - slice documents by CHUNK_SPLIT_BY (deployed default markdown; code default token). token mode measures chunk size in the embedding model’s actual HuggingFace tokens (important for e5-large’s 512-token cap once the "passage: " prefix is added); markdown mode splits on heading hierarchy first, records the heading breadcrumb in meta.headers, then token-packs each section over CHUNK_SIZE; word/sentence/passage use Haystack’s built-in splitter. CHUNK_SIZE (400) and CHUNK_OVERLAP (80) bound chunk length, and the tokenizer is TOKENIZER_MODEL (falling back to EMBEDDING_MODEL) pinned to TOKENIZER_REVISION in deployments. Every chunk gets a sequential meta.split_id.
Embed (dense) - required. EMBEDDING_PROVIDER selects openai-compat (the current embed.itkdev.dk path), fastembed (in-process), or tei. EMBEDDING_PREFIX_DOC is prepended to each chunk, and EMBEDDING_DIM (1024 for e5-large) sizes the dense vector Qdrant allocates.
Embed (sparse) - both deployed stacks run ENABLE_SPARSE_EMBEDDINGS=true, so a second named vector (text-sparse, BM42 via Qdrant/bm42-all-minilm-l6-v2-attentions) is added per chunk, letting retrieval use Qdrant’s native RRF hybrid query instead of client-side BM25. Dense-only is the code default; when disabled the writer sees dense-only chunks.
Write - DocumentWriter backed by QdrantDocumentStore writes one point per chunk carrying the dense vector (text-dense, and text-sparse when enabled) as named vectors.

Idempotency. With overwrite=true (the default), each ingest is a blue/green version swap rather than a delete-then-write. Every run stamps its chunks with a fresh meta.ingest_version and writes them alongside any existing points for the same file_id; only after the write succeeds with at least one chunk are the stale versions (same file_id, a different ingest_version) swept. If any stage throws, the teardown deletes only the failed run’s version, leaving the previously indexed version intact. The whole sequence is guarded by the per-file_id lock so two concurrent requests can’t interleave. So a status: true response means the file is fully indexed against the new version, a failure leaves the prior version in place (no partial writes leak, and retrieval never sees a half-written index), and retrying the same file_id never duplicates vectors. Open WebUI’s reindex action relies on this. The success body is {status: true, collection_name, chunks_count} (plus an optional extraction object echoing the engine/profile actually used); a failure returns {status: false, error, code} where code is one of EXTRACTION_FAILED, EMBEDDING_FAILED, SPARSE_EMBEDDING_FAILED, QDRANT_WRITE_FAILED, DELETE_FAILED, S3_FETCH_FAILED, INVALID_REQUEST, or PIPELINE_FAILED.

Vision extraction and content-based routing

Three engines cover documents whose meaning lives in their layout rather than their text — flowcharts, diagrams, and scanned forms that a plain text extractor would flatten or drop:

vision-llm renders the document to page images and reconstructs it with a multimodal LLM. Office formats are converted to PDF by the Gotenberg sidecar; the PDF is rasterized locally (pypdfium2) at VISION_LLM_DPI (150), capped at VISION_LLM_MAX_PAGES (20); the pages go to VISION_LLM_API_BASE_URL (model VISION_LLM_MODEL) in a single call. A prompt profile shapes the output — diagram (lanes/phases/steps plus a Mermaid flowchart), general (faithful full-page Markdown), or ocr (plain-text transcription), with diagram-topology and figure used internally by the hybrid engine. VISION_LLM_LANGUAGE_HINT (Danish) keeps the source language verbatim, and output exceeding VISION_LLM_MAX_TOKENS fails the extraction rather than truncating silently.
hybrid-diagram (docx only) pairs the authoritative native text read straight from the package XML (verbatim labels, nothing OCR-guessed) with a vision-inferred Mermaid graph. It picks diagram-topology for a vector flowchart (labels are Word shapes) or figure for a raster PNG diagram (labels are pixels); a non-docx input falls through to plain vision-llm. It wraps vision-llm, so it needs the same VISION_LLM_* + Gotenberg config. This is the default diagram engine for auto.
auto routes each document by inspecting it (detectors.py). A .docx is sent to EXTRACTION_ROUTER_DIAGRAM_ENGINE (default hybrid-diagram) when either signal fires; everything else goes to EXTRACTION_ROUTER_DEFAULT (kreuzberg in deployments):
- Vector flowchart - drawing/text-box shapes ≥ EXTRACTION_ROUTER_MIN_TEXTBOXES (20) and drawing-to-body word ratio ≥ EXTRACTION_ROUTER_DRAWING_RATIO (2.0).
- Raster figure - body images ≥ EXTRACTION_ROUTER_MIN_BODY_IMAGES (1) with the largest <wp:extent> display area ≥ EXTRACTION_ROUTER_MIN_IMAGE_EMU (1_500_000_000_000 EMU² ≈ 1.79 in²). The area floor — not a count — rejects decoration; header/footer logos are excluded for free because only word/document.xml is read. The opt-in EXTRACTION_ROUTER_MIN_IMAGE_WORD_RATIO (0 = off) adds an image-area-to-body-words guard for corpora with large decorative hero photos.

flowchart TD
  A[document] --> B{".docx?"}
  B -- no --> D[EXTRACTION_ROUTER_DEFAULT<br/>kreuzberg]
  B -- yes --> C{"vector-flowchart<br/>OR raster-figure<br/>signal?"}
  C -- no --> D
  C -- yes --> E[EXTRACTION_ROUTER_DIAGRAM_ENGINE<br/>hybrid-diagram]
  E --> F{"vector vs raster"}
  F -- vector --> G[diagram-topology profile]
  F -- raster --> H[figure profile]

Seeing what was chosen. The /api/v1/extract response echoes the engine and profile used, and every chunk written by a vision/hybrid or auto-routed path carries meta.extractor / meta.vision_profile. The per-document routing decision (which engine/profile was chosen) is logged at INFO; the detailed detector signals behind it (textbox counts, drawing ratios, image areas) are logged at DEBUG, so set LOG_LEVEL_APP=DEBUG to see them. (DEBUG=true is a separate operator error-triage switch, not the log-verbosity dial.)

There is also a developer-facing POST /api/v1/extract probe that runs only the converter and returns the raw extracted documents - useful for comparing extraction engines without a full ingest. It takes an optional engine override (any concrete engine, but not auto) and an optional profile (accepted only by the vision engines), and responds with {status, engine, profile, documents}.

Deleting and inspecting documents

Two more endpoints round out the document lifecycle:

DELETE /api/v1/documents/{file_id} removes every point for a file_id (all versions) from Qdrant, returning a DeleteResponse. Failures surface as the DELETE_FAILED code. Open WebUI calls this when a file is removed from a knowledge base.
GET /api/v1/documents/{file_id}/chunks is a read-only inspection endpoint that returns the stored chunks and their payload for a file - useful for debugging extraction/chunking. It is gated by ENABLE_INSPECTION_API (default false) so it is off unless explicitly enabled.

5. Retrieval path

Open WebUI calls POST /search (authenticated with RAG_EXTERNAL_RETRIEVAL_API_KEY) with either explicit queries or the raw chat messages, plus the collection_names to search and a k. The response is one document list per query, each with a parallel distances list – higher always means more relevant, but the score scale depends on the active path: normalized cosine in [0, 1] for plain dense search, Qdrant’s raw RRF fusion scores when hybrid is enabled, and the cross-encoder’s relevance scores when reranking is enabled. ENABLE_AGENTIC_RAG selects which pipeline runs.

Linear mode (`ENABLE_AGENTIC_RAG=false`)

A fast, deterministic pipeline:

flowchart TD
  A[POST /search] --> A1{messages +<br/>query generation?}
  A1 -- yes --> A2[LLM: generate<br/>optimized queries]
  A1 -- no --> A3[explicit queries /<br/>last user message]
  A2 --> B[Embed dense<br/>+ optional sparse]
  A3 --> B
  B --> D{hybrid?}
  D -- no --> Cdense[Qdrant dense search]
  D -- yes --> S{collection has<br/>text-sparse?}
  S -- yes --> Cnative[Qdrant Query API<br/>server-side RRF]
  S -- no --> Cbm25[dense search +<br/>client-side BM25 RRF]
  Cdense --> F{rerank?}
  Cnative --> F
  Cbm25 --> F
  F -- yes --> G[cross-encoder rerank]
  F -- no --> H[dedup by MD5 + limit k]
  G --> H
  H --> I[Response]

The hybrid branch is capability-aware: if the collection carries a text-sparse vector (because ingestion wrote one), it uses Qdrant’s server-side RRF; otherwise it falls back to scrolling the filtered content into memory for client-side BM25 fusion. When reranking is on, the initial fetch is k × INITIAL_RETRIEVAL_MULTIPLIER candidates, rescored down to k by an external /v1/rerank endpoint (default embed.itkdev.dk); if that call fails the pipeline falls back to the unranked candidates rather than erroring. Results are deduplicated by content hash.

Agentic mode (`ENABLE_AGENTIC_RAG=true`)

A PydanticAI tool-calling loop that reuses the same retrieval helpers but lets an LLM drive query formulation and relevance grading:

flowchart TD
  A[POST /search] --> B[Build prompt from last<br/>N conversation messages]
  B --> C[Agent: generate 1-2 queries]
  C --> D{retrieve tool called?}
  D -- no --> FB[parse fallback queries<br/>+ direct vector search]
  D -- yes --> H["retrieve(queries) tool<br/>embed → search → ± rerank"]
  H --> H4[return truncated previews<br/>full results kept side-channel]
  H4 --> K{any result on-topic?}
  K -- yes --> M[dedup full results + limit k]
  K -- no --> N{iter < AGENT_MAX_ITERATIONS?}
  N -- yes --> C
  N -- no --> M
  FB --> M
  M --> O[Response<br/>full text from side-channel]

Key behaviours:

The agent is told to generate 1–2 queries, call retrieve once, accept if any returned document is on-topic, and only retry with rewritten queries when results are completely off-topic.
A raw-query seed (AGENT_INCLUDE_RAW_QUERY, default on) runs one deterministic retrieval of the user’s original query before the agent loop and merges those hits into the results, so the answer never depends solely on the LLM’s rewrites.
A side-channel (AgentDeps.full_results) accumulates the full retrieval results across iterations, while the tool only returns truncated previews to the LLM (AGENT_PREVIEW_K items, AGENT_TOOL_PREVIEW_CHARS each) so the context window doesn’t balloon on retries. The final response uses the full text, not the previews.
If the model emits queries as plain text instead of a tool call, a fallback parses them and runs a direct vector search (rerank skipped).
AGENT_TIMEOUT is wall-clock; on timeout whatever was already retrieved is returned (partial results, not a 500).
Agentic mode adds roughly 2–5× latency and 2–4× token cost per query.

6. Configuration touchpoints

All settings are environment variables. The tables below show only the cross-service and Open WebUI-facing ones; each repo’s .env.example is the full list.

Open WebUI → the two services

Open WebUI variable	Points at	Must equal
`EXTERNAL_INGESTION_ENGINE=external`	ingestion-service	- (switch that enables external ingestion)
`EXTERNAL_INGESTION_API_KEY`	ingestion-service	ingestion-service `API_KEY`
`RAG_EXTERNAL_RETRIEVAL_API_KEY`	retrieval-agent	retrieval-agent `API_KEY`

In the parent docker-compose, a single INGESTION_API_KEY is forked into the ingestion-service’s API_KEY and Open WebUI’s EXTERNAL_INGESTION_API_KEY; likewise RETRIEVAL_API_KEY is forked into the retrieval-agent’s API_KEY and Open WebUI’s RAG_EXTERNAL_RETRIEVAL_API_KEY - you do not set those by hand. The retrieval-agent’s agent LLM key is separate (RETRIEVAL_AGENT_API_KEY → AGENT_API_KEY).

ingestion-service (selected)

Defaults below are the values the deployed AarhusAI stacks set; where they differ from the service’s own code/.env.example default, the Notes column says so.

Variable	Default	Notes
`API_KEY`	(required)	Must equal Open WebUI’s `EXTERNAL_INGESTION_API_KEY`
`QDRANT_INDEX`	`ingestion_files`	Physical collection; must match retrieval-agent
`EMBEDDING_MODEL`	`intfloat/multilingual-e5-large`	Must match retrieval-agent
`EMBEDDING_DIM`	`1024`	Dense vector size; must match the model and retrieval-agent
`EMBEDDING_PREFIX_DOC`	`"passage: "`	Doc-side prefix (keep the trailing space)
`EXTRACTION_ENGINE`	`auto` (both stacks)	Full set `pypdf`/`kreuzberg`/`docling`/`unstructured`/`vision-llm`/`hybrid-diagram`/`auto`; code default `kreuzberg`
`EXTRACTION_ROUTER_DEFAULT`	`kreuzberg`	Non-diagram engine, consulted only when `EXTRACTION_ENGINE=auto`
`EXTRACTION_ROUTER_DIAGRAM_ENGINE`	`hybrid-diagram`	Diagram engine, consulted only when `EXTRACTION_ENGINE=auto`
`VISION_LLM_API_BASE_URL`	(empty)	Multimodal endpoint; empty disables the vision engines
`VISION_LLM_MODEL`	`AarhusAI-default-v2`	Model served by the vision endpoint (`gemma4-nvfp4` code default)
`GOTENBERG_URL`	`http://gotenberg:3000`	office→PDF sidecar for the vision engines
`CHUNK_SPLIT_BY`	`markdown`	`token`/`markdown`/`word`/`sentence`/`passage`; code default `token`
`CHUNK_SIZE`	`400`	Chunk length (HF tokens in token/markdown modes)
`CHUNK_OVERLAP`	`80`	Overlap between chunks
`TOKENIZER_REVISION`	(pinned)	Deploys pin the e5-large tokenizer SHA; empty = Hub HEAD
`ENABLE_SPARSE_EMBEDDINGS`	`true`	Adds the sparse vector enabling native hybrid retrieval; code default `false`
`SPARSE_EMBEDDING_MODEL`	`Qdrant/bm42-all-minilm-l6-v2-attentions`	BM42; must match retrieval-agent when hybrid is on
`S3_ALLOWED_BUCKETS`	`openwebui`	Allow-list of buckets the service may fetch from (empty = unenforced)
`ENABLE_INSPECTION_API`	`false`	Enables the read-only `GET /api/v1/documents/{file_id}/chunks` inspection endpoint
`METRICS_ENABLED`	`true`	Exposes the bearer-authed Prometheus `GET /metrics` endpoint

Both the dev and server stacks run EXTRACTION_ENGINE=auto, ship the Gotenberg sidecar, and point the vision engines (vision-llm / hybrid-diagram) at VISION_LLM_API_BASE_URL (the LiteLLM proxy at litellm.itkdev.dk in production). Both also enable sparse embeddings and markdown chunking, so the code defaults (kreuzberg, dense-only, token) describe an unconfigured service rather than either deployment.

retrieval-agent (selected)

Variable	Default	Notes
`API_KEY`	(required)	Must equal Open WebUI’s `RAG_EXTERNAL_RETRIEVAL_API_KEY`
`QDRANT_INDEX`	`ingestion_files`	Must match ingestion-service
`EMBEDDING_MODEL`	`intfloat/multilingual-e5-large`	Must match ingestion-service
`EMBEDDING_PREFIX_QUERY`	`"query: "`	Query-side prefix
`ENABLE_HYBRID_SEARCH`	`false`	Native sparse+dense when available, else BM25 fallback
`ENABLE_RERANKING`	`false`	Cross-encoder rerank stage
`ENABLE_QUERY_GENERATION`	`true`	LLM query generation from chat (linear pipeline)
`ENABLE_AGENTIC_RAG`	`false`	Route to the agentic loop instead of the linear pipeline
`AGENT_MODEL`	`gpt-4o-mini`	Model for agent decisions / query generation

7. Operations and health

Both services expose the same probe pair:

GET /health - liveness; 200 whenever the process is up.
GET /health/ready - readiness; 503 until dependencies are reachable. The ingestion-service additionally waits for its Haystack pipeline to warm up (with sparse enabled — the deployed norm — the BM42 embedder pulls its ~80 MB model from HuggingFace, cached in the ingestion_model_cache volume so it survives restarts); the retrieval-agent verifies Qdrant connectivity. This keeps Docker/Kubernetes from routing traffic during cold start. Readiness gates only the pipeline warm-up and Qdrant — Gotenberg and the Vision LLM endpoint are not probed, so a vision dependency being down surfaces per-request as an EXTRACTION_FAILED rather than blocking startup.

Both services also expose GET /metrics - a Prometheus scrape endpoint, bearer-authenticated with the same API_KEY and gated by METRICS_ENABLED (default on; returns 404 when disabled).

8. Reference

ingestion-service - https://github.com/AarhusAI/ingestion-service
retrieval-agent - https://github.com/AarhusAI/retrieval-agent
Gotenberg - office→PDF render sidecar used by the vision extraction engines
Patches - the Open WebUI patches applied in this stack