Enhanced AI OCR Pipeline for Scientific Research Papers

Introduction

Scientific research articles hold critical information, but much of it is locked behind intricate PDFs, tables, and figures difficult to read automatically. What we at Extralit do is enable scientific literature to be more readable, organized, and searchable for scientists.

Throughout my GSoC experience with Extralit Labs, I worked on these issues under Project Idea #2: AI OCR Extraction Pipeline Improvements in Scientific Literature.

The Problem

Valuable information in academic literature gets buried beneath complex tables and figures, which conventional OCRs cannot accurately retrieve.

Extralit already had a simple prototype to fetch text and tables from PDFs. Although it fared well with simple documents, there were serious problems with:

* Manual correction of complicated table structures

* Retaining the structure of the document, including headings and sections, for accurate RAG retrieval

* Efficient processing of large volumes of PDFs

The pipeline needed to be more accurate and efficient, capable of handling hundreds of life science research articles regularly, even under limited computing resources.

Methodology

The methodology behind our enhanced OCR extraction pipeline revolves around a distributed architecture that leverages the strengths of both the Extralit Server and a specialized Extralit-HF-Space environment.

When a researcher uploads a PDF, the Extralit Server saves it to S3 storage and initiates the workflow by enqueuing an initial analysis job. This job uses the Marker library to perform a fast layout analysis, identifying bounding boxes for tables, figures, and text, as well as page margins. The results are saved in PostgreSQL and used in subsequent extraction steps.

High-Level Approach

The pipeline operates across two main components:

* Extralit Server

* Manages PDF uploads, API requests, and orchestrates all background tasks.

* Stores files in S3 and tracks metadata in PostgreSQL.

* Extralit-HF-Space

* Handles GPU-intensive OCR and text extraction using PyMuPDF.

* Converts PDFs into hierarchically structured Markdown, preserving sections and headings.

Additional components supporting this workflow include:

* Redis RQ for asynchronous job queuing and parallel processing.

* Workers that process queued jobs in the background.

* VectorDB to store embedded sections for semantic search and retrieval.

* Pydantic Models to let users specify custom extraction outputs (e.g., text, tables, figures).

Workflow Overview

1. Document Upload

* Researchers upload PDFs via the Extralit web page.

* PDFs are stored in S3, and a new document record is created in PostgreSQL.

* An initial `async_marker_layout_job` is enqueued to begin layout analysis.

2. Layout Analysis (Marker on Modal

* Marker performs structural analysis (`force_ocr: False`), quickly detecting tables, figures, and text blocks without full OCR.

* Modal provides the GPU-powered, serverless compute (A100/H100) to run Marker efficiently.

* Output includes bounding boxes and layout metadata, stored in PostgreSQL for reference.

3. Text Extraction (Markdown)

* The `pymupdf_to_markdown_job` runs in Extralit-HF-Space or a similar compute environment.

* PyMuPDF converts the PDF into structured Markdown, preserving document hierarchy.

* Table of Contents (TOC) or heuristic header detection ensures logical sectioning.

* Extracted Markdown and metadata are stored in PostgreSQL.

4. Section Embedding & VectorDB

* Each section is embedded and saved in VectorDB for semantic search and retrieval.

* Users can control which outputs (text, tables, figures) to generate via Pydantic models.

This architecture ensures that the Extralit Server remains responsive while offloading heavy OCR and layout detection tasks to scalable, cloud-based compute environments like Modal and Hugging Face Spaces.

End-User Experience

* Extralit Web Page: Upload PDFs and receive structured Markdown output containing hierarchical headings, tables, and figures

* Extralit Hub: Extraction job monitoring, document management, server status, and progress tracking of tasks in the queue

* Custom Output Control: The user specifies which parts of the document to extract via Pydantic models

* Self-Hosted Option:

* Sign up on Extralit Hub and connect Supabase credentials.

* The deployment of the Extralit server is automated.

* Work is underway to automate the deployment of Marker on Modal for each user instance.

Users can also access Extralit directly through our Hugging Face Space deployment, which makes the pipeline globally available without requiring local setup. This enables researchers anywhere in the world to extract structured data from PDFs using shared compute resources, regardless of their local hardware capabilities.

This gives users flexibility; they can use the main Extralit server or run a personal instance in the cloud, with heavy computations automatically handled by Modal and parallelized via Redis RQ workers.

Challenges and Optimizations

1. OCR Latency

One of the first challenges we faced was the time taken to process PDFs, especially large, multi-page research papers. While Marker and PyMuPDF deliver highly accurate text extraction, the processing speed significantly decreases when handling dense layouts or hundreds of pages, creating a bottleneck in batch processing scenarios.

To address this, we implemented Redis RQ to distribute OCR jobs across multiple workers, enabling several PDFs to be processed concurrently rather than sequentially. On limited CPU resources, the system intelligently queues and schedules tasks to keep the Extralit server responsive.

This approach is throughput- and stability-oriented, prioritizing consistent performance for a wide range of workloads, even at the cost of some compute efficiency.

2. Table and Figure Segmentation

Scientific PDFs often contain complex tables with multi-level headers or embedded figures, which makes extraction significantly harder than simple text parsing. Standard OCR tools tend to flatten or misalign table cells, causing the loss of structural relationships between rows and columns.

Currently, Extralit extracts such tables as images when structure recognition fails. However, this remains an area for improvement. We plan to integrate Vision Transformer–based models, such as Table-Transformer, which understand spatial layouts and visual cues. This would enable the extraction of both textual and structural information from complex tables, improving machine readability and minimizing post-processing.

3. Document Structure and Hierarchy

Another major hurdle was preserving the logical structure of research papers—including headings, subheadings, and sections. PDFs rarely follow a consistent layout, and slight variations in font or indentation can break section detection.

Early heuristic-based header recognition approaches yielded unreliable results, particularly for documents lacking a clear visual hierarchy. To improve this, we leveraged PyMuPDF's Table of Contents (when available) and combined it with font-size-based heuristics using the `IdentifyHeaders` module.

This hybrid method proved to be both reliable and efficient, converting raw text into structured Markdown while maintaining the readability and organization of the original paper.

4. User Deployment Automation

Enabling researchers to easily deploy their own OCR servers presented another challenge. Although users can already sign up via the Extralit Hub and set up servers using credentials from Supabase, automating the Marker on Modal deployment has been more complex.

This limitation arises because there is currently no direct API or OAuth support for user-specific compute provisioning. To overcome this, we are exploring solutions such as pre-configured templates or minimal-setup scripts that allow researchers to spin up their own Extralit instance with a single command.

Our long-term goal is to make the infrastructure behind Extralit self-service and globally accessible.

5. Efficient Job Scheduling on Limited Compute

Heavy OCR workloads on shared environments, such as Hugging Face Spaces, required careful job management. We use Redis RQ not only for concurrency but also for scheduling long-running jobs on bounded CPU resources.

This setup ensures that jobs are queued, resumed, or retried gracefully without overloading the system. It allows for continuous document processing, even under tight compute limits—an important step toward democratizing access to large-scale scientific data extraction.

Next Steps

While the pipeline already provides accurate and structured extraction, several improvements are planned to enhance usability, scalability, and research utility:

1. Advanced Table and Figure Extraction

* Implement Vision Transformer-based models to improve accuracy on complex tables and extract data from figures.

2. Improved Parallel Processing and Scalability

* Optimize Redis RQ workers and caching mechanisms to handle larger volumes of PDFs efficiently.

3. Enhanced User Deployment

* Continue automating self-hosted Extralit server deployments on Hugging Face, including Marker on Modal, so users can easily run their own instances.

4. Feature Expansion

* Add more user controls in Extralit Hub for customized extraction, job monitoring, and server management.

5. Open-Source Support and Documentation

* Provide clear setup guides, reproducible deployment examples, and community-friendly documentation.

Enhanced AI OCR Extraction Pipeline for Scientific Literature