Tom Hosking @tomhosking X Profile

Tom Hosking

@tomhosking

Followers

927

Following

4K

Media

171

Statuses

1K

Model merging lead for Command A @cohere. Prev: PhD student in NLP @EdinburghNLP @Edin_CDT_NLP, @BloomsburyAI @UCL @DRWTrading

https://t.co/B1EiAyO4vp

Edinburgh, Scotland

Joined April 2009

Don't wanna be here? Send us removal request.

Jay Alammar

@JayAlammar

3 months

As one of the earliest builders of LLMs, Cohere realized early that enterprises need more than a model -- they need: 1) a secure solution (with private deployment) 2) that connects to their data (Salesforce, email, Slack, or internally defined with MCP) 3) Is powered by an LLM

Aidan Gomez

@aidangomez

3 months

We believe AI should eliminate the mundane, not compromise your data. That’s why we built North — an agentic AI platform designed for real work, real teams, and extreme security. Here's what makes it different..

1

11

92

Aidan Gomez

@aidangomez

3 months

We believe AI should eliminate the mundane, not compromise your data. That’s why we built North — an agentic AI platform designed for real work, real teams, and extreme security. Here's what makes it different..

7

42

317

Pierre Richemond 🇪🇺

@TheOneKloud

4 months

Excited to reveal what I've been working on for the last few months. Command-A-Vision is our new flagship 112B VLM that outperforms Llama 4 Maverick, Mistral Medium/Pixtral Large, GPT 4.1, and others. We release weights on HF https://t.co/7KZUGv2AT3 and hope you'll like it.

4

34

147

Cohere Labs

@Cohere_Labs

7 months

How does sparse attention reshape LLM scaling? 🔍 We’re excited to share this work by former @Cohere intern @p_nawrot, “The Sparse Frontier: Sparse Attention Trade-offs in Transformer LLMs.”

Piotr Nawrot

@p_nawrot

7 months

Sparse attention is one of the most promising strategies to unlock long-context processing and long generation reasoning in LLMs. We performed the most comprehensive study on training-free sparse attention to date. Here is what we found:

1

9

30

Maxime Voisin

@maximevoisin_ai

7 months

TIL @cohere's best LLM (Command A) is higher than Anthropic's best LLM on the Arena

0

3

32

Arduin Findeis

@arduinfindeis

7 months

How exactly was the initial Chatbot Arena version of Llama 4 Maverick different from the public HuggingFace version?🕵️ I used our Feedback Forensics app to quantitatively analyse how exactly these two models differ. An overview…👇🧵

3

7

25

Matthias Gallé

@mgalle

7 months

You like Code, you like LLMs, you are looking for a leadership position? We are searching for somebody who can support our amazing team and bring code agents for enterprises to new heights! https://t.co/EaKsIpgZU8

1

10

79

Douwe Kiela

@douwekiela

7 months

When we came up with RAG five years ago, we weren't creating a workaround for small context windows—we were designing a principled approach to augment models with external knowledge. The core challenges RAG addresses remain unsolved with just larger context windows: • Accessing

2

4

16

Tom Hosking

@tomhosking

7 months

Now feels like a good time to plug @cohere Command A: - model evaled on @lmarena_ai is same as hosted on @huggingface - claimed performance is reproducible - not trained on the test set - uses the @cohere hybrid attention architecture for long context - fits on 2xH100 not 8x

cohere

@cohere

8 months

We’re excited to introduce our newest state-of-the-art model: Command A! Command A provides enterprises maximum performance across agentic tasks with minimal compute requirements.

1

7

66

wh

@nrehiew_

7 months

The next section on Merging is the most interesting imo. As a summary of what we discussed earlier, they used expert merging: merge the SFT then merge the preference tuned models

1

4

21

wh

@nrehiew_

7 months

Some thoughts: First, the paper is pretty well-written and easy to follow. They have so many benchmarks and results. I think its similar in spirit to the llama3 paper but they are more complementary imo as this paper focuses heavily on post training while Llama3 didn't do a good

6

9

55

wh

@nrehiew_

7 months

They find that linear merging is pretty interpretable ie upweight an expert leads to better performance in that domain. However, the corresponding drop in performance is unpredictable Interestingly, they add cross-domain data for each expert as a form of regularization.

2

4

19

wh

@nrehiew_

7 months

The most interesting part of their post training is just how much they use model merging both in SFT and RL. Their process is: - Train an instruct model - Train 6 SFT in 6 domains(Code, Safety, RAG, Math, Multilingual, and General Long-Context) - Merge - Use this merge to train

3

15

141

aakanksha

@____aakanksha

8 months

the complete cooking guide with all the ingredients, seasonings and garnishes for this soup of a model is here! 🍲🧂🌶️🔥 couldn’t be more proud to have continued my exploration of model merging and translated it to our A-class flagship model, Command A, with the best team! ✨

cohere

@cohere

8 months

We’re redefining what’s possible with AI. With the release of our latest model, Command A, optimized for real-world agentic and multilingual tasks, we’re demonstrating our commitment to bringing enterprises AI that goes beyond the ordinary, and offers security & efficiency.

0

4

48

cohere

@cohere

8 months

We’re redefining what’s possible with AI. With the release of our latest model, Command A, optimized for real-world agentic and multilingual tasks, we’re demonstrating our commitment to bringing enterprises AI that goes beyond the ordinary, and offers security & efficiency.

8

24

114

Cohere Labs

@Cohere_Labs

8 months

Following the open-weight release of Command A and Command R7B models, we're excited to have collaborated with @Cohere colleagues on a tech report highlighting our novel approach to model training, including self-refinement algorithms and model merging techniques at scale.

1

17

72

Tom Hosking

@tomhosking

8 months

I'm really proud to have led the model merging work that went into @cohere Command A and R7B, all made possible by an amazing group of collaborators. Check out the report for loads of details on how we trained a GPT-4o level model that fits on 2xH100!

Max Bartolo

@max_nlp

8 months

I'm excited to the tech report for our @Cohere @CohereForAI Command A and Command R7B models. We highlight our novel approach to model training including the use of self-refinement algorithms and model merging techniques at scale. Command A is an efficient, agent-optimised

0

3

57

Viraat Aryabumi

@viraataryabumi

8 months

Merging 🍇 + polishing 🧽 = ⌘🧑🏼‍🍳

Max Bartolo

@max_nlp

8 months

I'm excited to the tech report for our @Cohere @CohereForAI Command A and Command R7B models. We highlight our novel approach to model training including the use of self-refinement algorithms and model merging techniques at scale. Command A is an efficient, agent-optimised

0

5

23

Max Bartolo

@max_nlp

8 months

I'm excited to the tech report for our @Cohere @CohereForAI Command A and Command R7B models. We highlight our novel approach to model training including the use of self-refinement algorithms and model merging techniques at scale. Command A is an efficient, agent-optimised

10

76

271

Max Bartolo

@max_nlp

8 months

I really enjoyed my @MLStreetTalk chat with Tim at #NeurIPS2024 about some of the research we've been doing on reasoning, robustness and human feedback. If you have an hour to spare and are interested in some semi-coherent thoughts revolving around AI robustness, it may be worth

3

18

67