Skip to main content
Back to Insights

Beyond OCR: Why We Built the VeriRFP Parser on Open-Source Docling

S
Sarah Jenkins
VeriRFP SecOps

When evaluating an AI platform to automate Requests for Proposal (RFPs) and Security Questionnaires, RevOps and InfoSec teams understandably focus heavily on the underlying Large Language Model (LLM). "Are you using GPT-4o? Claude 3.5 Sonnet?"

While the reasoning engine is critical, there is an equally vital—and chronically overlooked—component to the automation pipeline: the document parser.

If you feed an advanced LLM scrambled, structurally broken text extracted from a complex enterprise PDF, you will get hallucinated, inaccurate answers. Garbage in, garbage out.

In this technical deep dive, we outline why traditional Optical Character Recognition (OCR) and basic text extraction are fundamentally insufficient for modern B2B documentation, and why we architected the VeriRFP ingestion engine around advanced, open-source document parsing frameworks like IBM's Docling.

The Flawed Approach: Legacy OCR and Naive Text Extraction

For decades, the standard approach to extracting text from PDFs has been either naive text layout extraction (like PyPDF) or legacy OCR (Tesseract). These tools were groundbreaking in their era, but enterprise documents have evolved far beyond what they were designed to handle. Modern RFPs routinely combine scanned images, embedded fonts, form fields, and layered formatting in a single file. Legacy tools were built for a world of single-column printed memos, not the sprawling multi-format documents that procurement and compliance teams produce today.

The Problem with Structural Blindness

Enterprise RFPs and SOC 2 reports are not linear novels. They are highly complex, unstructured documents containing:

  1. Multi-column layouts that basic parsers read straight across (combining sentence fragments from column A with column B).
  2. Dense, nested tables containing crucial compliance matrices (e.g., "Is MFA enforced?" -> "Yes, via Okta").
  3. Headers, footers, and watermarks that get injected into the middle of paragraphs, impacting the semantic integrity of the content.
  4. Nested list structures where indentation and numbering convey hierarchical relationships between requirements, sub-requirements, and acceptance criteria.
  5. Mixed content zones where a single page combines narrative text, tables, diagrams, and footnotes that all need to be interpreted in context.

When a basic OCR system attempts to read a page, it treats it as an image and extracts raw text strings. It lacks semantic understanding of the document's structure.

Consider a real-world example: a vendor risk assessment PDF where page 14 contains a two-column layout. The left column lists security control categories (Access Management, Encryption, Incident Response), and the right column details the organization's implementation status. A legacy OCR tool reads straight across the page, producing output like "Access Management AES-256 encryption at rest Encryption Okta SSO with MFA." The original meaning is completely destroyed. A human reading that output would be confused; an LLM attempting to answer "Does the vendor enforce MFA?" from that garbled text will either hallucinate an answer or refuse to respond altogether.

If a 50-row compliance table is extracted as a single chaotic paragraph of text, the LLM will struggle to correlate the security control with its corresponding status. This leads to the most dangerous type of AI failure in our industry: silent inaccuracies. The AI confidently states "Yes, disk encryption is enabled" when the source document actually said "Planned for Q3"—because the table cell alignment was lost during extraction.

Why PDF is Inherently Difficult

It is worth understanding why PDF parsing is so challenging in the first place. The PDF specification (ISO 32000) was designed as a presentation format, not a data interchange format. Internally, a PDF does not store text in reading order. It stores individual characters at specific x,y coordinates on a canvas. There is no inherent concept of a "paragraph" or a "table cell" in the PDF specification itself. Reconstructing those structures from raw coordinate data is a genuinely hard computer science problem—one that simple text extraction libraries were never designed to solve.

This is why two PDFs that look identical to a human reader can produce wildly different extraction results. A PDF generated from Microsoft Word preserves some logical structure in its metadata. A PDF created by scanning a printed document and running it through a copier contains nothing but pixel data. A PDF exported from a government procurement system might use custom encodings that break standard character mapping entirely. The parser must handle all of these cases reliably.

The Solution: Semantic Document Parsing (The Docling Advantage)

To build a robust, enterprise-grade AI automation platform, the ingestion engine must understand a document exactly as a human reader does. It must recognize that a table is a table, a header is a header, and a multi-column layout flows logically from top to bottom.

This requires advanced structural layout analysis and semantic parsing—a fundamentally different approach from the character-by-character extraction that legacy OCR provides.

IBM's Docling project represents a significant leap forward in this space. Released under the MIT license, Docling combines state-of-the-art vision transformer models with rule-based heuristics to achieve document understanding that approaches human-level accuracy. It treats each page not as a flat image to be scanned, but as a structured composition of discrete semantic elements.

How VeriRFP Automates Ingestion

We bypassed legacy OCR tools and built our ingestion pipeline utilizing advanced, MIT-licensed open-source models, drawing significant architectural inspiration from vision-centric document parsers like Docling.

Here is what happens when you upload a 100-page secure architecture whitepaper into VeriRFP:

  1. Layout Analysis (Vision Models): Instead of just reading text, the platform runs a specialized computer vision model over every page to identify the bounding boxes of different structural elements. It categorizes regions as Titles, Paragraphs, Lists, Tables, or Images. These models are trained on hundreds of thousands of annotated document pages spanning corporate filings, government RFPs, technical specifications, and compliance reports, giving them broad coverage of the formatting patterns encountered in enterprise documentation.
  2. Reading Order Restoration: The engine reconstructs the logical reading order. If the page contains three columns, it extracts the text column by column, not straight across. This step also handles more subtle challenges like sidebar callouts, pull quotes, and marginal annotations that should be associated with adjacent paragraphs rather than interrupting the main text flow.
  3. Table Structure Recognition (TSR): This is the most crucial step for security and compliance teams. The parser identifies the grid structure of tables, accurately mapping rows and columns so that tabular data remains well aligned when converted into structured formats (like Markdown or JSON). The TSR model handles merged cells, spanning headers, and tables that break across multiple pages—all common patterns in compliance matrices and vendor questionnaires that trip up basic extraction tools.
  4. Semantic Markdown Generation: The final output is not a raw text file; it is a beautifully structured Markdown document. Headers are formatted as ## H2, bolded terms are preserved, and tables are converted into true Markdown tables. This structured output serves as the foundation for every downstream AI operation, from chunking and embedding to retrieval and answer generation.

Handling Edge Cases at Scale

Real-world enterprise documents are full of edge cases that can silently degrade extraction quality. Watermarks overlaid on text. Rotated pages in the middle of a document. Form fields with embedded values. Redacted sections. Digital signatures that alter the rendering layer. Headers and footers that repeat on every page and must be stripped to avoid polluting the extracted content.

Our ingestion pipeline handles each of these cases through a combination of pre-processing heuristics and model-based classification. Watermarks are detected and excluded from the text layer. Page rotation is normalized before layout analysis. Repeated headers and footers are identified through cross-page pattern matching and removed from the final output. The result is clean, semantically faithful text that represents only the meaningful content of the document.

The Importance of High-Fidelity Ingestion for RAG

Why does this deeply technical parsing matter to a Sales Engineer answering a 300-question vendor risk assessment?

Because VeriRFP utilizes a Retrieval-Augmented Generation (RAG) architecture. When the AI searches your knowledge base to answer a question, it relies on vectorizing small chunks of the document perfectly.

If the parser scrambled a compliance table, the vector embedding is useless, and the LLM cannot retrieve the correct answer.

How Chunking Depends on Structure

RAG systems do not feed an entire 100-page document to the LLM at once. They split the document into smaller chunks—typically 500 to 1,500 tokens each—and convert those chunks into vector embeddings stored in a database. When a user asks a question, the system finds the most semantically similar chunks and provides them as context to the LLM.

The quality of those chunks is entirely dependent on the quality of the parsed output. If the parser correctly identifies that rows 12 through 18 of a compliance table relate to encryption standards, that table segment becomes a self-contained, semantically meaningful chunk. The vector embedding accurately captures "encryption policy details," and the retrieval system can surface it when someone asks "What encryption standards does the organization follow?"

Now consider what happens when a legacy parser extracts that same table as a jumbled paragraph. The chunk might contain fragments of three different table rows interleaved with a page header. The resulting vector embedding is a semantic blur that matches poorly against any specific question. The retrieval system either returns irrelevant chunks or fails to return the correct one. Either outcome produces a wrong answer.

Accuracy Metrics That Matter

In internal benchmarking against standard PDF extraction libraries, the advanced parsing approach used by VeriRFP consistently demonstrates measurably higher fidelity on the document types that matter most to our users. Table extraction accuracy—measured by cell-level content alignment—improves dramatically compared to naive extraction. Reading order accuracy on multi-column documents approaches near-perfect levels. These are not abstract benchmarks; they translate directly into fewer hallucinated answers and higher confidence scores on every RFP response the platform generates.

Why Open Source Matters for Enterprise Trust

There is a deliberate reason we built on MIT-licensed, open-source parsing models rather than proprietary black-box document processing APIs.

Enterprise buyers—particularly those in regulated industries like financial services, healthcare, and government contracting—need to understand and audit the tools processing their sensitive documents. When your SOC 2 report or internal security architecture document is being parsed, you deserve to know exactly how that parsing works. Open-source foundations provide that transparency.

Building on Docling also means our parsing layer benefits from contributions by the broader research community. As IBM and independent contributors improve the underlying vision models and table recognition algorithms, those improvements flow into our pipeline. This is a fundamentally different value proposition from being locked into a single vendor's proprietary extraction API, where improvements happen on their timeline and their priorities.

By utilizing advanced, layout-aware parsing models, VeriRFP ensures that the semantic meaning of your most complex technical documents is perfectly preserved. When the AI drafts an answer about your data encryption standards, it is pulling from a perfectly structured, high-fidelity representation of your documentation.

Are you struggling with inaccurate AI responses because your current tool can't read your complex technical formatting? Let us show you the difference. Schedule a technical demo of the VeriRFP parser today.

Related resources

Automate Securely

Ready to cut questionnaire turnaround time without losing evidence traceability or exposing sensitive buyer materials?

For implementation detail, continue to the product walkthrough or browse the Learn library.