An obscure Chinese PDF content extraction model for max accuracy and min cost

Demonstrates a lesser-known Chinese PDF content extraction model with evaluation, error analysis, setup code, and sglang optimizations that achieve high accuracy and 20× speedups.

Overview

We needed higher accuracy and lower cost when extracting the contents of a PDF into a structured format. After evaluating many open-source and commercial solutions, we’ve found one that achieves extremely high accuracy and low cost. But, it is not well known and have very little community surrounding it, so it required extra work to get it working.

We’ll give a brief overview of document content extraction models, explain our evaluation, then show specific accuracy errors and our evaluation results. Our presentation will conclude with showing code for how we first got the model working, and then working quickly (20x faster) using sglang.

Tech stack