Most PDF-extraction stacks either return plain text or fragile bounding boxes; this project focuses on recovering readable structure and element semantics so documents become machine-actionable. It combines vision models and classical ML to turn PDFs into structured JSON/Markdown/HTML with reading order, table/formula extraction, and optional translation.
What Sets It Apart
- Dual-model option: Vision Grid Transformer (VGT) for higher segmentation accuracy and LightGBM models for much faster, lower-resource processing — so you can choose accuracy for research outputs or speed for bulk pipelines.
- End-to-end microservice design: Docker-first, GPU support, a REST API for integration and a Gradio UI for quick inspection — so teams can run locally or containerize for production without heavy glue code.
- Practical extraction features: OCR (Tesseract) with 150+ languages, table extraction to HTML, formula extraction to LaTeX, and an algorithm to determine reading order — so extracted content is closer to human-readable and downstream-ready.
- Translation & model interoperability: Optional Ollama-powered translation and model-selection hooks (Hugging Face/Docker), enabling multilingual pipelines and easier model swaps.
Who it's for and trade-offs
Great fit if you need automated, structured PDF parsing at scale (data engineering, legal/human-rights documentation, research ingestion) and want both an interactive UI and API-first deployment. Look elsewhere if you need a lightweight single-file extractor (this repo is a full microservice and includes model binaries/config), or if you require an enterprise SLA-managed hosted service — this is an open-source, self-hosted project that assumes developer ops capacity.
Where It Fits
Compared with single-purpose OCR libraries, this project sits between an OCR engine and a full document understanding platform: it adds layout segmentation, semantic labeling, and format conversion so the output is ready for search, indexing, or content pipelines.
Quick notes on maintenance & adoption
The repo is actively developed (models, examples, Hugging Face and Docker Hub integrations are provided). Deployments are simplified via docker-compose and Makefile targets, but expect model download steps and GPU setup for the VGT path.
