Tom Hosking
@tomhosking
Followers
927
Following
4K
Media
171
Statuses
1K
Model merging lead for Command A @cohere. Prev: PhD student in NLP @EdinburghNLP @Edin_CDT_NLP, @BloomsburyAI @UCL @DRWTrading
Edinburgh, Scotland
Joined April 2009
As one of the earliest builders of LLMs, Cohere realized early that enterprises need more than a model -- they need: 1) a secure solution (with private deployment) 2) that connects to their data (Salesforce, email, Slack, or internally defined with MCP) 3) Is powered by an LLM
We believe AI should eliminate the mundane, not compromise your data. That’s why we built North — an agentic AI platform designed for real work, real teams, and extreme security. Here's what makes it different..
1
11
92
We believe AI should eliminate the mundane, not compromise your data. That’s why we built North — an agentic AI platform designed for real work, real teams, and extreme security. Here's what makes it different..
7
42
317
Excited to reveal what I've been working on for the last few months. Command-A-Vision is our new flagship 112B VLM that outperforms Llama 4 Maverick, Mistral Medium/Pixtral Large, GPT 4.1, and others. We release weights on HF https://t.co/7KZUGv2AT3 and hope you'll like it.
4
34
147
How does sparse attention reshape LLM scaling? 🔍 We’re excited to share this work by former @Cohere intern @p_nawrot, “The Sparse Frontier: Sparse Attention Trade-offs in Transformer LLMs.”
Sparse attention is one of the most promising strategies to unlock long-context processing and long generation reasoning in LLMs. We performed the most comprehensive study on training-free sparse attention to date. Here is what we found:
1
9
30
TIL @cohere's best LLM (Command A) is higher than Anthropic's best LLM on the Arena
0
3
32
How exactly was the initial Chatbot Arena version of Llama 4 Maverick different from the public HuggingFace version?🕵️ I used our Feedback Forensics app to quantitatively analyse how exactly these two models differ. An overview…👇🧵
3
7
25
You like Code, you like LLMs, you are looking for a leadership position? We are searching for somebody who can support our amazing team and bring code agents for enterprises to new heights! https://t.co/EaKsIpgZU8
1
10
79
When we came up with RAG five years ago, we weren't creating a workaround for small context windows—we were designing a principled approach to augment models with external knowledge. The core challenges RAG addresses remain unsolved with just larger context windows: • Accessing
2
4
16
Now feels like a good time to plug @cohere Command A: - model evaled on @lmarena_ai is same as hosted on @huggingface - claimed performance is reproducible - not trained on the test set - uses the @cohere hybrid attention architecture for long context - fits on 2xH100 not 8x
We’re excited to introduce our newest state-of-the-art model: Command A! Command A provides enterprises maximum performance across agentic tasks with minimal compute requirements.
1
7
66
The next section on Merging is the most interesting imo. As a summary of what we discussed earlier, they used expert merging: merge the SFT then merge the preference tuned models
1
4
21
Some thoughts: First, the paper is pretty well-written and easy to follow. They have so many benchmarks and results. I think its similar in spirit to the llama3 paper but they are more complementary imo as this paper focuses heavily on post training while Llama3 didn't do a good
6
9
55
They find that linear merging is pretty interpretable ie upweight an expert leads to better performance in that domain. However, the corresponding drop in performance is unpredictable Interestingly, they add cross-domain data for each expert as a form of regularization.
2
4
19
The most interesting part of their post training is just how much they use model merging both in SFT and RL. Their process is: - Train an instruct model - Train 6 SFT in 6 domains(Code, Safety, RAG, Math, Multilingual, and General Long-Context) - Merge - Use this merge to train
3
15
141
the complete cooking guide with all the ingredients, seasonings and garnishes for this soup of a model is here! 🍲🧂🌶️🔥 couldn’t be more proud to have continued my exploration of model merging and translated it to our A-class flagship model, Command A, with the best team! ✨
We’re redefining what’s possible with AI. With the release of our latest model, Command A, optimized for real-world agentic and multilingual tasks, we’re demonstrating our commitment to bringing enterprises AI that goes beyond the ordinary, and offers security & efficiency.
0
4
48
We’re redefining what’s possible with AI. With the release of our latest model, Command A, optimized for real-world agentic and multilingual tasks, we’re demonstrating our commitment to bringing enterprises AI that goes beyond the ordinary, and offers security & efficiency.
8
24
114
Following the open-weight release of Command A and Command R7B models, we're excited to have collaborated with @Cohere colleagues on a tech report highlighting our novel approach to model training, including self-refinement algorithms and model merging techniques at scale.
1
17
72
I'm really proud to have led the model merging work that went into @cohere Command A and R7B, all made possible by an amazing group of collaborators. Check out the report for loads of details on how we trained a GPT-4o level model that fits on 2xH100!
I'm excited to the tech report for our @Cohere @CohereForAI Command A and Command R7B models. We highlight our novel approach to model training including the use of self-refinement algorithms and model merging techniques at scale. Command A is an efficient, agent-optimised
0
3
57
Merging 🍇 + polishing 🧽 = ⌘🧑🏼🍳
I'm excited to the tech report for our @Cohere @CohereForAI Command A and Command R7B models. We highlight our novel approach to model training including the use of self-refinement algorithms and model merging techniques at scale. Command A is an efficient, agent-optimised
0
5
23
I'm excited to the tech report for our @Cohere @CohereForAI Command A and Command R7B models. We highlight our novel approach to model training including the use of self-refinement algorithms and model merging techniques at scale. Command A is an efficient, agent-optimised
10
76
271
I really enjoyed my @MLStreetTalk chat with Tim at #NeurIPS2024 about some of the research we've been doing on reasoning, robustness and human feedback. If you have an hour to spare and are interested in some semi-coherent thoughts revolving around AI robustness, it may be worth
3
18
67