Vik Paruchuri
@VikParuchuri
Followers
15K
Following
2K
Media
186
Statuses
2K
Open source AI. Founder of @datalabto Past: founded @dataquestio
Brooklyn,NY
Joined June 2012
We shipped Chandra (our SOTA OCR model) but base latency wasn't good enough for production. So we trained an Eagle3 draft model: ✅3× lower p99 latency ✅40% higher throughput ✅zero accuracy loss Here's how we made Chandra OCR 3× faster with Eagle3 speculative decoding 🧵
3
6
36
DM, email, or see link for other application methods if you're interested!
0
0
4
Salary - 175k-250k, up to .5% equity, which is very meaningful at this stage. Read more here - https://t.co/dM1JFLSE2U .
1
0
6
We recently released Chandra, the highest benchmarking OCR model - https://t.co/vgMz7z9wfj . Culture at @datalabto - low ego, GSD, talk to customers. This week, we shipped Chandra 1.1, track changes, major speed improvements, better excel support, chart understanding.
datalab.to
1
0
7
We're looking for an engineer who loves talking to customers. We have an insane amount of inbound from frontier AI labs to F500s - work on prototypes, help close deals, and contribute to our SoTA OCR models. In person in NYC.
8
9
156
If you're struggling to eval OCR tools, we're happy to help by running our internal eval tool - just DM me. If you want to collaborate on open external benchmarks, let me know - we're working on something here.
2
0
3
But we need to rethink our approach to OCR benchmarking - do better matching, have wider domain coverage, and have harder edge cases.
1
0
5
Read our full post here - https://t.co/8P4zXYqHs2 . This is not to minimize in any way the work of the AI2 team - benchmarks take a lot of work, and I'm very appreciative of what they did with the OlmOCR benchmark.
datalab.to
1
0
2
This doesn't even cover training on benchmarks - some benchmarks have cross-contamination with common training sets. You can see this when model vibe checks don't match benchmark results.
1
0
2
Two conclusions from this: - most improvement on this benchmark will come from matching benchmark formatting - there are only a couple of percentage points of real accuracy gains left
1
0
2
Much of the rest of the distance (est 3-4%) is subjective or incorrect ground truth. Here, the benchmark penalizes you if you lowercase "Republican party", even though the image shows it lowercase.
2
0
2
This shows that a significant % of the remaining distance to get to 100% on the benchmark involves aligning your model formatting to the benchmark.
1
0
1
We ran an experiment where we gave gemini the failed test case, the failure reason, and the chandra markdown, and asked it to judge correctness. This enables fuzzy matching.
1
0
1
The OlmOCR benchmark is, in my opinion, the best external OCR benchmark. However, it has issues common to all benchmarks. One is matching with edit distance - for example, this is marked wrong even though it renders the same.
1
0
3
Chandra OCR scores 93.9 on the OlmOCR benchmark, if we correct for minor formatting differences. Let's discuss what this says about OCR benchmarking 🧵
10
13
170
Credit to @voberoi for this! He found nice ways to preserve formatting in the output while merging in changes.
0
0
3
While working with legal tech companies, we found that this feature is critical for many workflows. A simple example is deciding with edits to accept/reject. Let us know if you want to try it - happy to chat, or set you up with some credits.
1
0
3
Try it out here - https://t.co/31ABATmh2C (select the "track changes" extra and upload a docx file). You can use it via API by passing "track_changes" in the "extras" field.
1
0
4
The Datalab API can now extract redlines and comments into clean markdown! This is great for analyzing legal documents with LLMs.
8
9
72
Chandra OCR is here if you haven't seen it - https://t.co/7ApMvLZLWL . We have a hosted playground at https://t.co/6oRFS5G46P .
1
0
7