Vik Paruchuri @VikParuchuri X Profile

Vik Paruchuri

@VikParuchuri

Followers

15K

Following

2K

Media

186

Statuses

2K

Open source AI. Founder of @datalabto Past: founded @dataquestio

https://t.co/LCSRcWJByo

Brooklyn,NY

Joined June 2012

Don't wanna be here? Send us removal request.

Datalab

@datalabto

10 days

We shipped Chandra (our SOTA OCR model) but base latency wasn't good enough for production. So we trained an Eagle3 draft model: ✅3× lower p99 latency ✅40% higher throughput ✅zero accuracy loss Here's how we made Chandra OCR 3× faster with Eagle3 speculative decoding 🧵

3

6

36

Vik Paruchuri

@VikParuchuri

13 days

DM, email, or see link for other application methods if you're interested!

0

4

Vik Paruchuri

@VikParuchuri

13 days

Salary - 175k-250k, up to .5% equity, which is very meaningful at this stage. Read more here - https://t.co/dM1JFLSE2U .

1

0

6

Vik Paruchuri

@VikParuchuri

13 days

We recently released Chandra, the highest benchmarking OCR model - https://t.co/vgMz7z9wfj . Culture at @datalabto - low ego, GSD, talk to customers. This week, we shipped Chandra 1.1, track changes, major speed improvements, better excel support, chart understanding.

datalab.to

1

0

7

Vik Paruchuri

@VikParuchuri

13 days

We're looking for an engineer who loves talking to customers. We have an insane amount of inbound from frontier AI labs to F500s - work on prototypes, help close deals, and contribute to our SoTA OCR models. In person in NYC.

8

9

156

Vik Paruchuri

@VikParuchuri

14 days

If you're struggling to eval OCR tools, we're happy to help by running our internal eval tool - just DM me. If you want to collaborate on open external benchmarks, let me know - we're working on something here.

2

0

3

Vik Paruchuri

@VikParuchuri

14 days

But we need to rethink our approach to OCR benchmarking - do better matching, have wider domain coverage, and have harder edge cases.

1

0

5

Vik Paruchuri

@VikParuchuri

14 days

Read our full post here - https://t.co/8P4zXYqHs2 . This is not to minimize in any way the work of the AI2 team - benchmarks take a lot of work, and I'm very appreciative of what they did with the OlmOCR benchmark.

datalab.to

1

0

2

Vik Paruchuri

@VikParuchuri

14 days

This doesn't even cover training on benchmarks - some benchmarks have cross-contamination with common training sets. You can see this when model vibe checks don't match benchmark results.

1

0

2

Vik Paruchuri

@VikParuchuri

14 days

Two conclusions from this: - most improvement on this benchmark will come from matching benchmark formatting - there are only a couple of percentage points of real accuracy gains left

1

0

2

Vik Paruchuri

@VikParuchuri

14 days

Much of the rest of the distance (est 3-4%) is subjective or incorrect ground truth. Here, the benchmark penalizes you if you lowercase "Republican party", even though the image shows it lowercase.

2

0

2

Vik Paruchuri

@VikParuchuri

14 days

This shows that a significant % of the remaining distance to get to 100% on the benchmark involves aligning your model formatting to the benchmark.

1

0

1

Vik Paruchuri

@VikParuchuri

14 days

We ran an experiment where we gave gemini the failed test case, the failure reason, and the chandra markdown, and asked it to judge correctness. This enables fuzzy matching.

1

0

1

Vik Paruchuri

@VikParuchuri

14 days

The OlmOCR benchmark is, in my opinion, the best external OCR benchmark. However, it has issues common to all benchmarks. One is matching with edit distance - for example, this is marked wrong even though it renders the same.

1

0

3

Vik Paruchuri

@VikParuchuri

14 days

Chandra OCR scores 93.9 on the OlmOCR benchmark, if we correct for minor formatting differences. Let's discuss what this says about OCR benchmarking 🧵

10

13

170

Vik Paruchuri

@VikParuchuri

16 days

Credit to @voberoi for this! He found nice ways to preserve formatting in the output while merging in changes.

0

3

Vik Paruchuri

@VikParuchuri

16 days

While working with legal tech companies, we found that this feature is critical for many workflows. A simple example is deciding with edits to accept/reject. Let us know if you want to try it - happy to chat, or set you up with some credits.

1

0

3

Vik Paruchuri

@VikParuchuri

16 days

Try it out here - https://t.co/31ABATmh2C (select the "track changes" extra and upload a docx file). You can use it via API by passing "track_changes" in the "extras" field.

1

0

4

Vik Paruchuri

@VikParuchuri

16 days

The Datalab API can now extract redlines and comments into clean markdown! This is great for analyzing legal documents with LLMs.

8

9

72

Vik Paruchuri

@VikParuchuri

20 days

Chandra OCR is here if you haven't seen it - https://t.co/7ApMvLZLWL . We have a hosted playground at https://t.co/6oRFS5G46P .

1

0

7