Vik Paruchuri Profile
Vik Paruchuri

@VikParuchuri

Followers
15K
Following
2K
Media
186
Statuses
2K

Open source AI. Founder of @datalabto Past: founded @dataquestio

Brooklyn,NY
Joined June 2012
Don't wanna be here? Send us removal request.
@datalabto
Datalab
10 days
We shipped Chandra (our SOTA OCR model) but base latency wasn't good enough for production. So we trained an Eagle3 draft model: ✅3× lower p99 latency ✅40% higher throughput ✅zero accuracy loss Here's how we made Chandra OCR 3× faster with Eagle3 speculative decoding 🧵
3
6
36
@VikParuchuri
Vik Paruchuri
13 days
DM, email, or see link for other application methods if you're interested!
0
0
4
@VikParuchuri
Vik Paruchuri
13 days
Salary - 175k-250k, up to .5% equity, which is very meaningful at this stage. Read more here - https://t.co/dM1JFLSE2U .
1
0
6
@VikParuchuri
Vik Paruchuri
13 days
We recently released Chandra, the highest benchmarking OCR model - https://t.co/vgMz7z9wfj . Culture at @datalabto - low ego, GSD, talk to customers. This week, we shipped Chandra 1.1, track changes, major speed improvements, better excel support, chart understanding.
datalab.to
1
0
7
@VikParuchuri
Vik Paruchuri
13 days
We're looking for an engineer who loves talking to customers. We have an insane amount of inbound from frontier AI labs to F500s - work on prototypes, help close deals, and contribute to our SoTA OCR models. In person in NYC.
8
9
156
@VikParuchuri
Vik Paruchuri
14 days
If you're struggling to eval OCR tools, we're happy to help by running our internal eval tool - just DM me. If you want to collaborate on open external benchmarks, let me know - we're working on something here.
2
0
3
@VikParuchuri
Vik Paruchuri
14 days
But we need to rethink our approach to OCR benchmarking - do better matching, have wider domain coverage, and have harder edge cases.
1
0
5
@VikParuchuri
Vik Paruchuri
14 days
Read our full post here - https://t.co/8P4zXYqHs2 . This is not to minimize in any way the work of the AI2 team - benchmarks take a lot of work, and I'm very appreciative of what they did with the OlmOCR benchmark.
datalab.to
1
0
2
@VikParuchuri
Vik Paruchuri
14 days
This doesn't even cover training on benchmarks - some benchmarks have cross-contamination with common training sets. You can see this when model vibe checks don't match benchmark results.
1
0
2
@VikParuchuri
Vik Paruchuri
14 days
Two conclusions from this: - most improvement on this benchmark will come from matching benchmark formatting - there are only a couple of percentage points of real accuracy gains left
1
0
2
@VikParuchuri
Vik Paruchuri
14 days
Much of the rest of the distance (est 3-4%) is subjective or incorrect ground truth. Here, the benchmark penalizes you if you lowercase "Republican party", even though the image shows it lowercase.
2
0
2
@VikParuchuri
Vik Paruchuri
14 days
This shows that a significant % of the remaining distance to get to 100% on the benchmark involves aligning your model formatting to the benchmark.
1
0
1
@VikParuchuri
Vik Paruchuri
14 days
We ran an experiment where we gave gemini the failed test case, the failure reason, and the chandra markdown, and asked it to judge correctness. This enables fuzzy matching.
1
0
1
@VikParuchuri
Vik Paruchuri
14 days
The OlmOCR benchmark is, in my opinion, the best external OCR benchmark. However, it has issues common to all benchmarks. One is matching with edit distance - for example, this is marked wrong even though it renders the same.
1
0
3
@VikParuchuri
Vik Paruchuri
14 days
Chandra OCR scores 93.9 on the OlmOCR benchmark, if we correct for minor formatting differences. Let's discuss what this says about OCR benchmarking 🧵
10
13
170
@VikParuchuri
Vik Paruchuri
16 days
Credit to @voberoi for this! He found nice ways to preserve formatting in the output while merging in changes.
0
0
3
@VikParuchuri
Vik Paruchuri
16 days
While working with legal tech companies, we found that this feature is critical for many workflows. A simple example is deciding with edits to accept/reject. Let us know if you want to try it - happy to chat, or set you up with some credits.
1
0
3
@VikParuchuri
Vik Paruchuri
16 days
Try it out here - https://t.co/31ABATmh2C (select the "track changes" extra and upload a docx file). You can use it via API by passing "track_changes" in the "extras" field.
1
0
4
@VikParuchuri
Vik Paruchuri
16 days
The Datalab API can now extract redlines and comments into clean markdown! This is great for analyzing legal documents with LLMs.
8
9
72
@VikParuchuri
Vik Paruchuri
20 days
Chandra OCR is here if you haven't seen it - https://t.co/7ApMvLZLWL . We have a hosted playground at https://t.co/6oRFS5G46P .
1
0
7