Basic RAG with Spring AI: Build a Grounded Q&A System from Scratch

Large language models are impressive, but they have a fundamental limitation: they can only work with what they learned during training. Ask about your company’s internal docs, last week’s release notes, or anything after the training cutoff — and the model will either hallucinate an answer or admit it doesn’t know.

Retrieval-Augmented Generation (RAG) solves this by giving the LLM access to your own documents at query time. Instead of relying on memorised knowledge, the model receives relevant context retrieved from a knowledge base and generates an answer grounded in that context.

This is the first post in a hands-on series exploring RAG patterns with Spring AI. Each post maps to a demo in the rag-spring-ai project — starting from a minimal retrieval pipeline and progressively layering in production concerns like multi-format ingestion, conversational memory, advisor composition, structured output, and function calling. Every demo runs fully offline with Ollama and Docker Compose — no cloud accounts required.

In this first installment, we build a complete Basic RAG system from scratch: ingest a document, embed it into a vector store, and answer questions grounded in the retrieved context. Every line of code comes from Demo 1: Basic RAG.

1. What Is RAG?

RAG stands for Retrieval-Augmented Generation. The idea is simple: before asking the LLM to answer a question, first retrieve the most relevant pieces of your documents and include them as context in the prompt.

Diagram comparing a plain LLM that may hallucinate versus a RAG-augmented LLM that grounds answers in retrieved documents. — **Figure:** A plain LLM answers from memory alone and may hallucinate. A RAG-augmented LLM retrieves relevant documents first, producing grounded, accurate answers.

Why RAG Matters

Problem with plain LLMs	How RAG solves it
Hallucination — makes up facts	Grounds answers in real documents
Outdated knowledge — training cutoff	Can use up-to-the-minute data
No access to private data	Connects to your own document store
No source attribution	Can cite where the answer came from

RAG is the most cost-effective way to give an LLM access to your own data. Unlike fine-tuning (which modifies model weights and requires GPU hours), RAG leaves the model untouched — you’re just changing what goes into the prompt.

2. The Two Phases of RAG

Every RAG system has two distinct phases: ingestion (offline, one-time) and query (online, per request).

Phase 1: Ingestion — Preparing Your Documents

Before the LLM can answer questions about your documents, those documents need to be loaded, split into chunks, converted to vector embeddings, and stored in a vector database.

Four-step ingestion pipeline: load document, split into chunks, compute embeddings, store in vector store. — **Figure:** The ingestion pipeline — documents are loaded, split into token-sized chunks, embedded as 768-dimensional vectors, and stored in PgVectorStore for later retrieval.

Here’s what each step does:

Load — Read the raw document into a Document object. Spring AI provides TextReader for .txt files, JsonReader for JSON, and TikaDocumentReader for PDF, DOCX, HTML, and other formats.
Split — Large documents don’t fit in a single embedding. TokenTextSplitter breaks them into smaller chunks (default: 800 tokens per chunk, 350 token overlap). The overlap ensures that context isn’t lost at chunk boundaries.
Embed — Each chunk is converted to a 768-dimensional floating-point vector using an embedding model (nomic-embed-text via Ollama). Texts with similar meaning end up close together in vector space.
Store — The vectors are written to a PgVectorStore (PostgreSQL with the pgvector extension), which builds an HNSW index for fast approximate nearest-neighbour search.

Phase 2: Query — Answering Questions

When a user asks a question, the system embeds the question using the same embedding model, searches the vector store for the most similar document chunks, and passes those chunks as context to the LLM alongside the question.

Figure: The query flow — the user's question is embedded, matched against stored vectors, and the top-K relevant chunks are injected into the LLM prompt as context. The LLM produces a grounded answer.

The key insight: the LLM never sees your entire document. It only sees the chunks that are semantically relevant to the question. This keeps the prompt focused and within token limits.

3. Technology Stack

Everything runs locally — no cloud accounts, no API keys:

Layer	Technology	Role
Framework	Spring Boot 3.5 + Spring AI 1.1	Web app + AI abstractions
Chat model	Qwen 3 4B via Ollama	Generates answers
Embedding model	nomic-embed-text via Ollama	Converts text to 768-dim vectors
Vector store	PostgreSQL 16 + pgvector	Persistent similarity search
Infrastructure	Docker Compose	Runs PostgreSQL + Ollama

Architecture diagram showing Spring Boot application with BasicRagController, BasicRagService, QuestionAnswerAdvisor, and ChatClient connecting to Ollama and PostgreSQL with pgvector. — **Figure:** The three-tier architecture — the Spring Boot app orchestrates the RAG pipeline, Ollama provides the LLM and embedding models, and PostgreSQL with pgvector stores the document vectors.

4. Project Setup

Dependencies (pom.xml)

The project uses Spring AI’s BOM for version management. The key dependencies:

<properties>
    <java.version>25</java.version>
    <spring-ai.version>1.1.4</spring-ai.version>
</properties>

<dependencyManagement>
    <dependencies>
        <dependency>
            <groupId>org.springframework.ai</groupId>
            <artifactId>spring-ai-bom</artifactId>
            <version>${spring-ai.version}</version>
            <type>pom</type>
            <scope>import</scope>
        </dependency>
    </dependencies>
</dependencyManagement>

<dependencies>
    <!-- Spring AI Ollama Starter (Chat + Embeddings) -->
    <dependency>
        <groupId>org.springframework.ai</groupId>
        <artifactId>spring-ai-starter-model-ollama</artifactId>
    </dependency>

    <!-- Spring AI PgVector Store (persistent vector store) -->
    <dependency>
        <groupId>org.springframework.ai</groupId>
        <artifactId>spring-ai-starter-vector-store-pgvector</artifactId>
    </dependency>

    <!-- QuestionAnswerAdvisor (the RAG advisor) -->
    <dependency>
        <groupId>org.springframework.ai</groupId>
        <artifactId>spring-ai-advisors-vector-store</artifactId>
    </dependency>
</dependencies>

spring-ai-starter-model-ollama auto-configures a ChatModel and EmbeddingModel backed by Ollama.
spring-ai-starter-vector-store-pgvector auto-configures a PgVectorStore backed by PostgreSQL with pgvector.
spring-ai-advisors-vector-store provides QuestionAnswerAdvisor — the component that wires retrieval into the chat pipeline.

Configuration (application.yaml)

spring:
  datasource:
    url: jdbc:postgresql://localhost:5432/ragdb
    username: raguser
    password: ragpassword

  ai:
    ollama:
      base-url: http://localhost:11434
      chat:
        options:
          model: qwen3:4b
          temperature: 0.7
          num-predict: 512
      embedding:
        options:
          model: nomic-embed-text

    vectorstore:
      pgvector:
        initialize-schema: true    # auto-creates the vector_store table
        dimensions: 768            # must match nomic-embed-text output
        index-type: HNSW
        distance-type: COSINE_DISTANCE

Key points:

initialize-schema: true tells Spring AI to create the vector_store table and pgvector extension automatically on first startup.
dimensions: 768 must match the output dimension of your embedding model (nomic-embed-text produces 768-dimensional vectors).
HNSW is an approximate nearest-neighbour index — much faster than brute-force search at the cost of slightly imprecise results.

Infrastructure (docker-compose.yml)

services:
  postgres:
    image: pgvector/pgvector:pg16
    container_name: rag-postgres
    environment:
      POSTGRES_DB: ragdb
      POSTGRES_USER: raguser
      POSTGRES_PASSWORD: ragpassword
    ports:
      - "5432:5432"

  ollama:
    image: ollama/ollama:latest
    container_name: rag-ollama
    ports:
      - "11434:11434"

  ollama-init:
    image: ollama/ollama:latest
    depends_on:
      ollama:
        condition: service_healthy
    entrypoint: ["/bin/sh", "-c"]
    command:
      - |
        ollama pull qwen3:4b
        ollama pull nomic-embed-text
        echo "All models ready!"
    restart: "no"

Three containers:

rag-postgres — PostgreSQL 16 with the pgvector extension pre-installed.
rag-ollama — The Ollama server that hosts the LLM and embedding model.
rag-ollama-init — A one-shot container that pulls the required models on first run (~3 GB download).

5. The Implementation — Step by Step

The entire Basic RAG demo lives in two files: BasicRagService.java (the logic) and BasicRagController.java (the HTTP layer). Let’s walk through each.

5.1 BasicRagService — Ingestion

The service loads a document, splits it into chunks, and stores the embeddings:

@Service
public class BasicRagService {

    private final ChatClient chatClient;
    private final VectorStore vectorStore;
    private final Resource overviewDocument;
    private final AtomicBoolean ingested = new AtomicBoolean(false);

    public BasicRagService(
            ChatClient.Builder chatClientBuilder,
            VectorStore vectorStore,
            @Value("classpath:documents/sample/spring-ai-overview.txt")
            Resource overviewDocument) {
        this.chatClient = chatClientBuilder
                .defaultSystem("You are an expert on Spring AI. "
                        + "Answer questions using the provided context.")
                .defaultAdvisors(new SimpleLoggerAdvisor())
                .build();
        this.vectorStore = vectorStore;
        this.overviewDocument = overviewDocument;
    }

    public void ingestDocuments() {
        if (!ingested.compareAndSet(false, true)) {
            return; // already ingested — skip
        }

        // 1. Load document
        var reader = new TextReader(overviewDocument);
        reader.getCustomMetadata().put("source", "spring-ai-overview.txt");
        List<Document> documents = reader.get();

        // 2. Split into chunks
        var splitter = new TokenTextSplitter();
        List<Document> chunks = splitter.apply(documents);

        // 3. Store (embedding is computed automatically by VectorStore)
        vectorStore.add(chunks);
    }
}

Let’s break this down:

Constructor injection — Spring AI auto-configures a ChatClient.Builder (backed by Ollama) and a VectorStore (backed by PgVector). We inject both, plus the document resource.

ChatClient.Builder — We customise the client with:

A system prompt that tells the LLM its role: “You are an expert on Spring AI.”
A SimpleLoggerAdvisor that logs every LLM call at DEBUG level.

AtomicBoolean guard — The compareAndSet(false, true) pattern ensures documents are ingested exactly once, even if multiple requests arrive concurrently. This is important because the vector store is persistent — we don’t want to re-add the same documents on every request.

Step 1: TextReader — Reads the .txt file from the classpath into a Document object. We attach "source" metadata so we can trace which document a chunk came from.

Step 2: TokenTextSplitter — Splits the document into embeddable chunks. Default settings: 800 tokens per chunk with 350-token overlap. The overlap ensures context continuity — if a key concept spans a chunk boundary, it appears in both adjacent chunks.

Step 3: vectorStore.add() — This single call does two things:

Sends each chunk to the embedding model (nomic-embed-text) to get a 768-dimensional vector.
Inserts the vector + text + metadata into the PostgreSQL vector_store table.

5.2 BasicRagService — Querying

The ask() method is where RAG happens:

public String ask(String question) {
    ingestDocuments(); // ensure documents are loaded

    return chatClient.prompt()
            .advisors(QuestionAnswerAdvisor.builder(vectorStore).build())
            .user(question)
            .call()
            .content();
}

That’s it. Four lines of code for the entire RAG query pipeline. Here’s what QuestionAnswerAdvisor does behind the scenes:

Embeds the question — converts "What is RAG?" into a 768-dimensional vector using the same embedding model.
Searches the vector store — performs a cosine similarity search against all stored document chunks, returning the top-K most relevant results (default K=4).
Augments the prompt — injects the retrieved chunks into the LLM prompt as context, between the system prompt and the user’s question.
Calls the LLM — sends the complete prompt (system + context + question) to qwen3:4b via Ollama.
Returns the answer — the LLM generates a response grounded in the retrieved context.

The ChatClient fluent API reads like natural language:

chatClient
    .prompt()                           // start building a prompt
    .advisors(QuestionAnswerAdvisor...) // add RAG retrieval
    .user(question)                     // set the user's question
    .call()                             // send to LLM
    .content();                         // extract the text response

5.3 BasicRagController — The HTTP Layer

The controller is a thin REST layer that delegates everything to the service:

@Validated
@RestController
@RequestMapping("/api/basic")
public class BasicRagController {

    private final BasicRagService ragService;

    public BasicRagController(BasicRagService ragService) {
        this.ragService = ragService;
    }

    @PostMapping("/ask")
    public Map<String, String> ask(
            @Valid @RequestBody QuestionRequest request) {
        String answer = ragService.ask(request.question());
        return Map.of(
                "question", request.question(),
                "answer", answer
        );
    }

    @PostMapping("/ingest")
    public Map<String, String> ingest() {
        ragService.ingestDocuments();
        return Map.of("status", "Documents ingested successfully");
    }
}

The request DTO uses a Java record with Bean Validation:

public record QuestionRequest(@NotBlank String question) {}

Two endpoints:

POST /api/basic/ingest — triggers document ingestion manually.
POST /api/basic/ask — accepts a JSON body with a question field, runs the RAG pipeline, and returns the grounded answer.

6. Running the Demo

Start the infrastructure

# Clone the project
git clone https://github.com/gdunhao/rag-spring-ai.git
cd rag-spring-ai

# Start PostgreSQL + Ollama (first run downloads ~3 GB of models)
docker compose up -d

# Wait for models to be pulled
docker compose logs -f ollama-init
# Wait until you see: "All models ready!"

Start the application

./mvnw spring-boot:run

Ingest documents and ask a question

# 1. Ingest the sample document
curl -s -X POST http://localhost:8080/api/basic/ingest | jq

# 2. Ask a RAG-powered question
curl -s -X POST http://localhost:8080/api/basic/ask \
  -H "Content-Type: application/json" \
  -d '{"question": "What is RAG and how does it work?"}' | jq

Example response:

{
  "question": "What is RAG and how does it work?",
  "answer": "RAG (Retrieval-Augmented Generation) works by: 1. Ingesting documents and splitting them into chunks, 2. Converting each chunk into a vector embedding, 3. Storing embeddings in a vector store, 4. When a query arrives, finding similar document chunks, 5. Including the retrieved chunks as context in the LLM prompt..."
}

7. The Interactive Playground

The rag-spring-ai project ships with a web-based playground that lets you interact with every demo — including Demo 1: Basic RAG — directly from your browser. No curl commands, no Postman, no separate client. Just open localhost:8080 after starting the application.

Interactive playground web UI at localhost:8080 showing the Basic RAG demo flow: ingest documents, ask questions, and view grounded answers with retrieved context chunks. — **Figure:** The interactive playground — a tab-based web UI that wraps every demo's API endpoints. Demo 1 lets you ingest documents, ask free-form questions, and inspect the retrieved context chunks alongside the grounded answer.

What the Playground Offers

Feature	Description
One-click ingestion	Press the Ingest button to load `spring-ai-overview.txt`, split it into chunks, embed, and store — no terminal required.
Free-form question input	Type any question about Spring AI and hit Ask. The playground calls `POST /api/basic/ask` behind the scenes.
Grounded answer display	The LLM response appears immediately, grounded in the retrieved context.
Retrieved context inspection	Below the answer, the playground shows the top-K context chunks that were injected into the prompt — so you can verify exactly what the LLM saw.
Tab-based demo navigation	Switch between Demo 1 (Basic RAG), Demo 2 (Document Ingestion), Demo 3 (Vector Store Operations), and the rest — each demo has its own dedicated tab.
Fully local	No API keys, no cloud accounts. The playground connects to Ollama and PostgreSQL running on your machine via Docker Compose.

Playground Endpoints (Demo 1)

The playground is a thin UI layer over the same REST API covered in section 5.3:

Action	HTTP Method	Endpoint	Request Body
Ingest documents	`POST`	`/api/basic/ingest`	—
Ask a question	`POST`	`/api/basic/ask`	`{"question": "..."}`

Behind the scenes, these are the exact same endpoints that the curl commands in section 6 call. The playground just provides a visual interface on top of them.

How to Launch

# 1. Start infrastructure (if not already running)
docker compose up -d

# 2. Start the Spring Boot application
./mvnw spring-boot:run

# 3. Open the playground
open http://localhost:8080

The playground is served by the same Spring Boot application — no separate frontend build or npm install. It loads instantly and works in any modern browser.

8. Understanding the Key Spring AI Abstractions

Let’s zoom into the Spring AI components that make this work.

ChatClient

ChatClient is the main entry point for interacting with LLMs in Spring AI. It provides a fluent, builder-style API:

ChatClient chatClient = ChatClient.builder(chatModel)
        .defaultSystem("You are an expert on Spring AI.")
        .defaultAdvisors(new SimpleLoggerAdvisor())
        .build();

String response = chatClient.prompt()
        .user("What is Spring AI?")
        .call()
        .content();

Think of ChatClient as the equivalent of RestTemplate or WebClient, but for LLM calls. It handles prompt construction, model invocation, and response extraction.

VectorStore

VectorStore is Spring AI’s abstraction over vector databases. The PgVectorStore implementation uses PostgreSQL with the pgvector extension:

// Store documents (embedding is computed automatically)
vectorStore.add(chunks);

// Search for similar documents
List<Document> results = vectorStore.similaritySearch(
        SearchRequest.builder()
                .query("What is RAG?")
                .topK(4)
                .build()
);

When you call vectorStore.add(chunks), Spring AI:

Sends each chunk’s text to the EmbeddingModel (nomic-embed-text).
Receives back a float[768] vector.
Inserts the text, metadata, and vector into the vector_store table.

QuestionAnswerAdvisor

This is the component that ties retrieval and generation together. Advisors in Spring AI are middleware-like interceptors that modify the prompt before it reaches the LLM:

chatClient.prompt()
    .advisors(QuestionAnswerAdvisor.builder(vectorStore).build())
    .user(question)
    .call()
    .content();

QuestionAnswerAdvisor intercepts the chat request, performs a similarity search on the vector store, and injects the retrieved documents into the prompt. The LLM then sees:

[System Prompt]
You are an expert on Spring AI. Answer questions using the provided context.

[Context - retrieved from vector store]
Spring AI is a framework that brings the power of AI to Spring applications...
The RAG pattern works by: 1. Ingesting documents and splitting them into chunks...

[User Question]
What is RAG and how does it work?

TokenTextSplitter

Documents need to be split into chunks that fit within the embedding model’s context window. TokenTextSplitter handles this:

var splitter = new TokenTextSplitter();  // defaults: 800 tokens, 350 overlap
List<Document> chunks = splitter.apply(documents);

Why chunking matters:

Embedding models have a maximum input length. If your document exceeds it, the embedding quality degrades.
Smaller, focused chunks produce better similarity matches than one giant embedding of the entire document.
The overlap (350 tokens) ensures that concepts spanning a chunk boundary appear in both chunks — so they can still be found by similarity search.

9. What Happens Under the Hood — End to End

Let’s trace a complete request through the system to solidify the mental model.

1. Client sends a question:

POST /api/basic/ask
{"question": "What vector databases can I use with Spring AI?"}

2. Controller receives the request, validates the @NotBlank constraint, and calls ragService.ask(question).

3. Service calls ingestDocuments(). The AtomicBoolean guard returns immediately because documents were already ingested.

4. Service builds the ChatClient prompt with QuestionAnswerAdvisor.

5. QuestionAnswerAdvisor intercepts the request:

Sends the question to nomic-embed-text → gets a float[768] vector.
Queries PgVectorStore for the 4 most similar chunks (cosine distance).
PostgreSQL uses the HNSW index for fast approximate search.
Returns chunks like: “Supported Vector Stores: SimpleVectorStore, PgVector, Chroma, Milvus, Pinecone…”

6. Augmented prompt is built:

System: “You are an expert on Spring AI.”
Context: (the retrieved chunks)
User: “What vector databases can I use with Spring AI?”

7. LLM (qwen3:4b via Ollama) generates a response grounded in the context.

8. Response flows back through the advisor → service → controller → client as JSON.

10. Key Takeaways

RAG is the simplest way to give an LLM access to your data. No model training, no GPU infrastructure — just retrieve relevant documents and add them to the prompt.
Spring AI makes RAG trivially easy. The QuestionAnswerAdvisor handles embedding, retrieval, and prompt augmentation in a single line of code.
The ingestion pipeline is straightforward: Load → Split → Embed → Store. Spring AI provides TextReader, TokenTextSplitter, and VectorStore abstractions for each step.
Everything runs locally. Ollama serves the models, PostgreSQL with pgvector stores the vectors, and Docker Compose orchestrates it all. Zero cloud dependencies.
The vector store is persistent. Once documents are ingested, they survive application restarts. You don’t need to re-ingest on every startup.
Test without infrastructure. Stub the ChatModel, mock the VectorStore, and run pure unit tests — no Docker required.

Series Roadmap

This post covers the foundation — a minimal, end-to-end RAG pipeline. Upcoming posts in this series will layer in progressively more advanced patterns, each mapping to a demo in the rag-spring-ai project:

Post	Topic	What it adds
→ You are here	Basic RAG	End-to-end retrieval pipeline with `QuestionAnswerAdvisor`
Coming next	Document Ingestion	Multi-format loading (JSON, PDF, DOCX), custom chunk sizes, metadata enrichment
	Vector Store Operations	Direct similarity search, threshold tuning, embedding inspection
	Chat with Memory	Conversational RAG with per-session history and context carryover
	Advisors	Composing RAG + memory + safety advisors in a pipeline
	Structured Output	Extracting typed Java records from LLM responses
	Function Calling	Letting the LLM invoke Java methods as tools
	Multi-Document RAG	Multiple document collections with smart routing
	Metadata Filtering	Scoping vector search with metadata filters

Each post follows the same pattern as this one: a focused walkthrough of a single demo, with diagrams, full code listings, and a runnable example you can try locally. Links will go live as posts are published — follow the RAG with Spring AI series page to see them all in one place.

Source code: github.com/gdunhao/rag-spring-ai — clone it, run make setup && make run, and open localhost:8080 for the interactive playground.