PDF to Markdown: Why Structured Output Matters for RAG Pipelines
When building retrieval-augmented generation (RAG) systems that ingest PDF documents, the output format of your extraction step matters more than most developers realize. Raw text extraction is fast and cheap — but it loses the structural signals that make documents comprehensible to language models.
Markdown strikes the right balance: it preserves document hierarchy while remaining token-efficient and LLM-native.
What gets lost in raw text extraction
Consider a typical PDF document — a research paper, a legal contract, or a financial report. These documents have rich structure: section headings, bullet lists, numbered procedures, tables, and logical hierarchy. That structure carries meaning.
When you extract a PDF to raw text, you get a flat string:
Introduction This paper presents a novel approach to... 1. Method We evaluated three approaches: first approach second approach third approach Results The accuracy improved by 23%...
The heading "Introduction" is indistinguishable from body text. The numbered list loses its identity. The structure that made the document scannable and comprehensible is gone.
Why this matters for RAG
RAG systems work by splitting documents into chunks, embedding them, and retrieving relevant chunks at query time. The quality of this process depends heavily on chunk quality — and chunk quality depends on the structure of the input text.
Chunking on structure boundaries
When your extracted text preserves Markdown headings, chunking algorithms can split on semantically meaningful boundaries — section breaks, heading levels — rather than arbitrary character counts. A chunk that starts at ## Methodology and ends before ## Results is far more coherent than a 512-token window that bisects a paragraph.
Heading injection for context
Many RAG frameworks support parent-document retrieval, where retrieved chunks are decorated with their parent heading for context. This only works if headings are explicitly marked in the source text. Markdown's #, ##, ### hierarchy makes this trivial to implement.
LLM comprehension
Language models are trained on massive corpora of Markdown-formatted text — documentation, README files, technical writing. They understand Markdown structure natively. Feeding a well-structured Markdown document to an LLM produces better summaries, more accurate citations, and cleaner question-answering than flat text.
Raw text vs. Markdown: a concrete comparison
Raw text extraction
Executive Summary The company achieved record revenue of $4.2B in FY2025 representing 23% growth year over year Key drivers included expansion into three new markets and the successful launch of Product X Revenue breakdown North America 58% EMEA 28% APAC 14%
Markdown extraction
# Executive Summary
The company achieved record revenue of $4.2B in FY2025, representing 23% growth year over year.
## Key drivers
- Expansion into three new markets
- Successful launch of Product X
## Revenue breakdown
- North America: 58%
- EMEA: 28%
- APAC: 14%
The Markdown version is immediately chunkable, heading-aware, and list-structured. An LLM asked "what drove revenue growth?" can trivially locate the "Key drivers" section.
How docpull detects document structure
docpull uses pdfjs-dist — the same PDF rendering engine used in Firefox — to extract text with layout metadata. It uses font-size heuristics to detect headings: text rendered at a significantly larger size than the document average is classified as a heading, with #, ##, or ### assigned based on relative size.
Bullet characters (•, ◦, ▸) are normalized to Markdown - lists. Numbered list patterns are detected and preserved. Multi-page documents include page separators (---) with page number comments.
This approach works well on machine-generated PDFs — reports, documentation, contracts. Scanned documents require OCR, which docpull does not currently handle.
Token efficiency
A common objection to Markdown is overhead — the formatting characters consume tokens. In practice, this overhead is minimal. A document with 100 headings adds roughly 300 tokens of Markdown syntax. The structural benefit — better chunking, better retrieval, better LLM comprehension — far outweighs the marginal token cost.
Practical implementation
Using docpull in a RAG pipeline is straightforward. After extraction, pass the Markdown directly to your chunking and embedding pipeline:
import { RecursiveCharacterTextSplitter } from "langchain/text_splitter";
// Extract PDF to Markdown
const { markdown } = await fetchWithPayment("https://docpull.ai/extract", {
method: "POST",
headers: { "Content-Type": "application/json" },
body: JSON.stringify({ url: pdfUrl }),
}).then(r => r.json());
// Chunk on Markdown structure boundaries
const splitter = new RecursiveCharacterTextSplitter({
chunkSize: 1000,
chunkOverlap: 100,
separators: ["\n## ", "\n### ", "\n---\n", "\n\n", "\n"],
});
const chunks = await splitter.createDocuments([markdown]);
// Chunks split on heading and page boundaries
Using Markdown heading separators in the splitter configuration ensures chunks respect document structure rather than cutting arbitrarily through paragraphs.