Python RAG API Tutorial with LangChain & FastAPI – Complete Guide

A hands-on guide to building a Retrieval-Augmented Generation (RAG) API using Python, LangChain, FastAPI, and pgvector — complete with architecture diagrams and code.

May 29, 2025

Introduction

Original article on my website: Python RAG API Tutorial with LangChain & FastAPI – Complete Guide

During last few months I was observing new releases in AI sector and new startups which are using AI. So I was curious what they are doing? How they are doing these AI things? While I have some experience with building AI applications I feel that's it's not enough and I want to know more about building AI apps. That's why with this new blog post I'm starting a new journey in my life - blogging about software engineering.

In this blog post I will explain how to build AI powered application to chat with uploaded PDF files. It will use these techniques and frameworks:

Retrieval Augmented Generation (RAG)
LangChain to build RAG and communicate with OpenAI
FastAPI to build API
Python 😊

High Level Architecture

pdf-analyzer - service which analyzes PDF documents and retrieves answers for user questions from PDF documents

User sends a question to the pdf-analyzer service
The pdf-analyzer services gets related document to a user question from the Postgres database
The pdf-analyzer sends a request with a user question and retrieved documents from the step 2 to OpenAI API to get an answer for a user question.

Before we will jump to the details of implementation let's understand why this architecture has been called "retrieval augmented generation".

Retrieval Augmented Generation (RAG)

Retrieval Augmented Generation (RAG) - the pattern in AI applications in which to provide an answer for a user am application will provide related information for a user request to LLM. Which will make LLM answer more "smarter" because LLM will get more context about a problem which it should solve.

So the process of RAG the best depicts this diagram:

User sends a request to AI application
AI application retrieves information from the external storage
AI application augments original user request with a retrieved information and sends to LLM to generate an answer

This approach results in much more better LLM responses than just directly send the document with a lot of pages to LLM and ask for a response.

Use Cases of RAG

Use case of the RAG pattern is to analyze information for cases when amount of information is higher than LLM context. While modern LLMs have huge context size RAG pattern can still be a benefit because if LLM context is filled more than 50% the chances of hallucinations are very high. So to get the best responses from LLM need to keep context usage minimal.

Use Cases of RAG in the real world

In the real world RAG can be used in these applications:

AI Chat with company documentation
Customer Support AI Bot
Frequent retrieval of information from unstructured data
Middle step of more complex flow

That's it from the theory and let's jump to the implementation part 😎

Implementation

User Flows

Upload PDF document

User uploads PDF document in the pdf-analyzer service.
The pdf-analyzer service reads PDF to text, splits text by chunks to increase accuracy of data retrieval.
The pdf-analyzer service uses OpenAI API to convert text to a vector which will represent provided text chunk. Next we will use this vector to perform search in the database by using math.
Save vector in the storage. So at this step we are saving numeric vectors of text and the text itself in the storage. Later we will use math to find the most relevant text chunks to a user question

Chat with uploaded PDF document

User sends a question via API to the pdf-analyzer service
The pdf-analyzer service converts user question to a numeric vector by using OpenAI API
The pdf-analyzer service finds the most close vectors in the storage for a user question.
The pdf-analyzer sends user question, retrieved documents and system prompt to the OpenAI API to get the most accurate answer

Technology Decisions

By knowing user flows above we can decide what technologies we will use to build this application.

LangChain Framework - the best framework to build AI systems which covers a lot of cases
Python - original language for LangChain is Python, so we will go with it
FastAPI - the modern and super convenient framework to build APIs in Python which can handle huge load. Also it allows to handle high load in Python.
Postgres - A mature database with a support of vector storage via plugin

Service Architecture

The pdf-analyzer service will use a classical layered architecture:

Routes files and chats will handle HTTP requests and use services to execute business logic
Services document service and ai service will execute business logic and integrate with Postgres and OpenAI API

This architecture approach provides a possibility to satisfy single responsibility principle and keep system simple.

The whole source code of an article is available at GitHub. For a simplicity of an article I will include only code which highlights the most important concepts of RAG API.

Implementation

Document Service

DocumentService - the service which is responsible to save/read documents.

import tempfile

from langchain_core.vectorstores import VectorStore
from langchain_core.documents import Document
from langchain_text_splitters.base import TextSplitter
from pdf_analyzer.models import File
from dataclasses import dataclass
from sqlmodel import Session
from langchain_community.document_loaders import PyPDFLoader
from pdf_analyzer.repositories.file import FileRepository
from uuid import UUID


@dataclass
class DocumentService:

    vector_store: VectorStore
    text_splitter: TextSplitter
    file_repository: FileRepository

    async def save(self, session: Session, file: File) -> File:
        # 1. Save file to the database
        file = self.file_repository.create_file(session, file)

        # 2. Convert file to a list of LangChain documents
        documents = self.__convert_to_documents(file)
	    # 3. Split list of LangChain documents to smaller documents to improve accuracy of RAG
        all_splits = self.text_splitter.split_documents(documents)
	    # 4. Adds metadata to a file to allow communicate with specific file 
        self.__add_metadata(all_splits, file)
	    # 5. Save documents in the vector store
        await self.vector_store.aadd_documents(all_splits)

        return file

    async def search(self, text: str, file_ids: list[UUID] = []) -> list[Document]:
        documents_filter = None
        if file_ids:
            documents_filter = {
                "file_id": {"$in": [str(file_id) for file_id in file_ids]}
            }
        return await self.vector_store.asimilarity_search(text, filter=documents_filter)

    def __add_metadata(self, documents: list[Document], file: File):
        for doc in documents:
            doc.metadata["file_name"] = file.name
            doc.metadata["file_id"] = str(file.id)

    def __convert_to_documents(self, file: File) -> list[Document]:
        with tempfile.NamedTemporaryFile(suffix=".pdf", delete=True) as tmp_file:
            tmp_file.write(file.content)
            tmp_file.flush()

            loader = PyPDFLoader(tmp_file.name)
            return loader.load()

The most interesting part of the system is this DocumentService which saves file in the database by following these steps:

Save file to the database
Convert file to a list of LangChain documents
Split list of LangChain documents to smaller documents to improve accuracy of RAG
Adds metadata to a file to allow communicate with specific file
Save documents in the vector store

Pretty important step is step 4 because at the end our user wants to communicate with specific files and not all files in the system. That's why we are adding metadata tag file_id in the __add_metadata method.

User 1 uploads file 1 and the __add_metadata method specifies file_id: 123 for it
User 2 uploads file 2 and the __add_metadata method specifies file_id: 456 for it

When users will search relevant content in files they will pass file_id tag which will be used to find specific files as it was done in the search method.

AI Service

AIService - the service which is responsible for OpenAI LLM API integration.

from langchain_core.language_models import BaseChatModel
from langchain_core.documents import Document
from langchain_core.prompts import ChatPromptTemplate
from pydantic import BaseModel, Field

prompt_template = ChatPromptTemplate.from_messages(
    [
        (
            "system",
            "You are an expert extraction algorithm. "
            "Only extract relevant information from the text. "
            "If you do not know the value of an attribute asked to extract, "
            "return null for the attribute's value.",
        ),
        ("system", "{data}"),
        ("human", "{text}"),
    ]
)


class Output(BaseModel):
    answer: str | None = Field(
        default=None,
        description="Answer on the question",
    )


class AIService:

    def __init__(self, llm: BaseChatModel):
        self.llm = llm
        self.structured_llm = llm.with_structured_output(schema=Output)

    def retrieve_answer(self, question: str, docs: list[Document]) -> str | None:
        data = "\n\n".join(doc.page_content for doc in docs)
        prompt = prompt_template.invoke({"text": question, "data": data})
        llm_result = self.structured_llm.invoke(prompt)

        return Output.model_validate(llm_result).answer if llm_result else None

The retrieval of an answer from a document looks like this:

The list of LangChain documents joins together in a string
LangChain prompt template substitutes template variables and generates a final prompt
LangChain llm class generates a structured response Output by sending my prompt to OpenAI
LLM response validates to be a valid Pydentic Output model

ChatService

ChatService - the service which is responsible for a user conversation with LLM and augmenting user requests to LLM.

from dataclasses import dataclass
from pdf_analyzer.schemas import ChatCreate
from pdf_analyzer.repositories import ChatRepository, MessageRepository
from pdf_analyzer.models import Chat, Message, SenderType
from sqlmodel import Session, select
from pdf_analyzer.schemas import MessageCreate
from pdf_analyzer.services.ai import AIService
from pdf_analyzer.services.document import DocumentService
from uuid import UUID
from typing import Sequence


@dataclass
class ChatService:
    chat_repository: ChatRepository
    message_repository: MessageRepository
    ai_svc: AIService
    document_svc: DocumentService

    def create_chat(self, session: Session, chat_create: ChatCreate):
        chat = Chat(name="New Chat", files=[])
        return self.chat_repository.create(session, chat, chat_create.file_ids)

    def find_all_chats(self, session: Session):
        return self.chat_repository.find_all(session)

    def get_chat(self, session: Session, chat_id: UUID):
        chat = session.exec(select(Chat).where(Chat.id == chat_id)).one_or_none()
        if not chat:
            raise ValueError(f"Chat with ID {chat_id} does not exist.")
        return chat

    async def send_message(
        self, session: Session, chat_id: UUID, message_create: MessageCreate
    ):
        human_message = Message(
            content=message_create.content,
            chat_id=chat_id,
            sender_type=SenderType.HUMAN,
        )

        chat = self.get_chat(session, chat_id)
        docs = await self.document_svc.search(
            human_message.content, [file.id for file in chat.files]
        )

        answer = self.ai_svc.retrieve_answer(
            human_message.content,
            docs,
        )
        if not answer:
            answer = "N/A"

        ai_message = Message(content=answer, chat_id=chat_id, sender_type=SenderType.AI)

        self.message_repository.save_messages(session, human_message, ai_message)

        return ai_message

    def find_messages(self, session: Session, chat_id: UUID) -> Sequence[Message]:
        return self.message_repository.find_by_chat_id(session, chat_id)

The most interesting method is send_message which is doing:

Gets chat by message id
Gets documents related to a chat
Sends a request to LLM with user request and retrieved documents
Save user message and AI response
Return a response to a user

Testing

0. Install dependencies

To run this project Poetry should be installed in the system.

poetry install - installs dependencies
poetry shell - uses virtualenv Python in this shell

1. Create .env file

Let's test this API by hands to see how it works. The code is available in GitHub so you can clone a repository and run code locally. Need to create .env file with specified variables:

PDF_ANALYZER_OPENAI_API_KEY - OpenAI API key.
PDF_ANALYZER_DB_URL - Postgres connections string.
- Specify postgresql://root:root@localhost:5432/pdf-analyzer if you will run Postgres from the docker-compose.yaml file.

2. Launch docker-compose.yaml

docker compose up -d - this will start Postgres with configured vector plugin in the Docker container.

3. Launch FastAPI server

Run this command to start FastAPI:

fastapi dev src/pdf_analyzer/main.py

4. Upload a file

Open http://127.0.0.1:8000/docs#/files/upload_file_files_upload__post and upload any file. I will upload Technology Radar pdf in my example.

5. Create a chat

Open http://127.0.0.1:8000/docs#/chats/create_chat_chats__post and create a chat with using file id received in a response after file uploading.

6. Send a message

Open http://127.0.0.1:8000/docs#/chats/send_message_chats__chat_id__message_post and send a message to a chat to communicate with uploaded file.

There is a response:

Conclusions

In this article I highlighted how to build RAG API in Python with LangChain and FastAPI. The source code is available on GitHub. This RAG technique looks useful and I will look to integrate it with some real world applications.

Original article on my website: Python RAG API Tutorial with LangChain & FastAPI – Complete Guide

Vitalii Honchar — AI & Backend Engineering

Discussion about this post

Ready for more?