Aman Arora @amaarora X Profile

Aman Arora

@amaarora

Followers

6K

Following

8K

Media

292

Statuses

4K

Lead AI Engineer | prev: W&B

https://t.co/k0LKBJ8CYz

Sydney, New South Wales

Joined June 2014

Don't wanna be here? Send us removal request.

Aman Arora

@amaarora

3 months

🧵 Most modern LLMs like Qwen, DeepSeek & gpt-oss use YaRN to extend context from 4K→128K tokens. But what led to YaRN? Today I'm proud and excited to share a comprehensive resource into the evolution of positional embeddings such as APE, RoPE, YaRN & variants👇 1/n

2

4

19

Andrej Karpathy

@karpathy

3 hours

https://t.co/Lb6T42n5jl

101

440

3K

Gentleman's Guru

@GentlemansGuru

3 days

Indulge yourself with class, style, and sophistication. Dress to impress this holiday season with Gentlemen's Guru. Shop the latest styles in men's formal wear and accessories for the modern gentleman. Get assistance from our Experts Award Winning Brand

0

8

59

Deedy

@deedydas

20 hours

Gemini 3 Flash is insane at OCR. It parses this extremely hard to read handwritten letter by Richard Feynman perfectly. It can do ~300 of these for $1. What's crazy is Feynman addresses General Donald J. Kutyna as "Katyna" which Gemini gets. There is no "Meeting Katyna", the

56

128

1K

Aman Arora

@amaarora

20 hours

Paper:

arxiv.org

A personalized LLM should remember user facts, apply them correctly, and adapt over time to provide responses that the user prefers. Existing LLM personalization benchmarks are largely centered on...

0

Aman Arora

@amaarora

20 hours

Emotional Adaptation Judge:

1

0

HealthActionNetwork

@HealthActionUS

2 hours

Mt. Sinai wants to raise prices and drive health care costs up by more than $1B. That means higher premiums and medical bills for New Yorkers. Putting profits over patients.

0

1

Aman Arora

@amaarora

20 hours

Memory ≠ likability! Just because an LLM can remember facts about a user, does not necessarily make it likeable. A new benchmark from Amazon measures LLM likeability across 7 different dimensions: emotional adaptation, formality matching, knowledge adaptation, reference

1

0

Maksym Andriushchenko

@maksym_andr

1 day

Interesting finding from our PostTrainBench: Sonnet 4.5 released ~3 months ago can barely improve the performance of base LLMs. But there's been _a lot_ of progress since then: - Opus 4.5 does perform much better - GPT-5.1 Codex Max outperforms the rest by a wide margin!

6

4

84

Will McGugan

@willmcgugan

1 day

Alrighty. The Toad is out of the bag. 👜🐸 Install toad to work with a variety of #AI coding agents with one beautiful terminal interface. Check out the blog post for more information... https://t.co/hMcnfyuMa9 I've been told I'm very authentic on camera. You just can't fake

46

68

517

dex

@dexhorthy

2 days

skills, commands, subagents are HIGH LEVERAGE which means you should probably WRITE THEM BY HAND at least for a while. If you let claude slop-out your instructions into agents/claude.md/skills etc, and you don't read them its going to vomit information from the training set,

26

18

237

CME Group

@CMEGroup

3 months

Drive your trading strategy forward with CME Group.

32

134

544

Aman Arora

@amaarora

1 day

So post-training went from domain-specific finetuning (few years ago) to: General instruction SFT (chat, multi-turn, tool use, code, summarization) → Reasoning-focused SFT/RL (with <think> tokens for CoT) → RLVR (verifiable rewards for math/code/reasoning boosts) → Preference

0

1

Claude

@claudeai

1 day

Using the extension, Claude Code can test code directly in the browser to validate its work. Claude can also see client-side errors via console logs. Try it out by running /chrome in the latest version of Claude Code.

51

145

1K

Aman Arora

@amaarora

1 day

In latest released benchmarking by Nemotron3, it's interesting that AIME-25 gets close to 100% score with tool call whereas for GPQA, not much difference (only about 2-3%).

0

Aman Arora

@amaarora

1 day

HF models page:

huggingface.co

0

Aman Arora

@amaarora

1 day

Nanbeige4-3b: A family of small but high-performing language models. All models are open weights & released on Huggingface. Pretrained on 23T tokens, and fine-tuned on 30M instructions followed by knowledge distillation using proposed Dual Preference Distillation (DPD) method &

1

0

1

ar.io network

@ar_io_network

2 days

Link rot is not a theoretical problem. It affects research, journalism, public records, and software. ARIO exists to keep data online and accessible even when platforms fail.

0

13

32

Omar Sanseviero

@osanseviero

1 day

Introducing FunctionGemma 🤏270m model for function calling 📱can run in your phone, browser or other devices 🤖designed to be specialized for your own tasks https://t.co/vU0YAeWWmH

46

155

1K

Generative History

@HistoryGPT

2 days

Gemini 3 flash is as good at reading handwriting as the average human (pro is expert human level). It is much better than both GPT-5.2 and Opus 4.5 with character level error rates of 1.43% and word level error rates of 2.74%. This is a 47-63% improvement over 2.5 Flash, the

26

102

957

Maxime Labonne

@maximelabonne

2 days

You always think you're safe until your job becomes a benchmark.

Maksym Andriushchenko

@maksym_andr

2 days

We release PostTrainBench: a benchmark measuring how well AI agents like Claude Code can post-train base LLMs. We expect this to be an important indicator for AI R&D automation as it unfolds over the next few years. 🔗 https://t.co/dVSSHkpAE1 📂 https://t.co/vqZNrQw66z 1/n

14

34

764

Ethan Mollick

@emollick

2 days

No signs of an end to rapid gains in AI ability at ever-decreasing costs (which is a log scale) yet. I have to update this monthly or more frequently at this point. All AI benchmarks are flawed, but GPQA Diamond has been a pretty good one, though likely close to being maxed out.

25

89

719

Tria

@useTria

4 days

Base assets are now usable in everyday life. Top up your Tria card, tap to pay globally, and keep full custody. Use creator coins anywhere Visa or Mastercard are accepted. Onchain meet real world.

436

245

774

NotebookLM

@NotebookLM

2 days

Slide Decks are officially our second most popular studio output! To celebrate, here are a few of our favorite ways to make the most of this feature: 1. Refine your existing slides— Upload any presentation to @NotebookLM along with your logo, brand guidelines, etc. Then, prompt

63

248

2K

François Chollet

@fchollet

2 days

Gemini 3 Flash across different test-time compute levels (green line below) represents a new score/cost Pareto frontier on ARC-AGI-2. Congrats to @demishassabis and @sundarpichai on the launch!

30

83

1K

Charlie Marsh

@charliermarsh

3 days

Announcing the Beta release of ty: an extremely fast type checker and language server for Python, written in Rust. We now use ty exclusively in our own projects and are ready to recommend it to motivated users. 10x, 50x, even 100x faster than existing type checkers and LSPs.

93

282

3K