Wissam Antoun @wissam_antoun X Profile

Wissam Antoun

@wissam_antoun

Followers

1K

Following

3K

Media

50

Statuses

489

PhD at ALMAnaCH/Inria Paris, @aubmindlab Alumni Interested in AI, NLP, Video Games https://t.co/l4sRNquzX7

https://t.co/MNpWJJKxFt

Paris, France

Joined March 2011

Don't wanna be here? Send us removal request.

Wissam Antoun

@wissam_antoun

2 years

Excited to share our latest research paper: "From Text to Source: Results in Detecting Large Language Model-Generated Content" We research cross-model detection and model attribution, covering a wide range of LLM sizes and families. Paper: https://t.co/WKCqUANUg0 A thread🧵

2

6

26

@levelsio

16 hours

🇪🇺 More great news from Europe 😊 Gradually, then suddenly. Nothing changed for 2 years and now a lot of things are finally changing: The Netherlands is changing its stock options tax to be modeled after the American system, which is the default in startups (!) Stock options

Bobby Gaal

@bobbygaal

1 day

@levelsio FYI: Point 7 is scheduled to be implemented in NL.

110

142

2K

rian

@riantouchent

1 day

Using neural classifiers as quality filters for pretraining data is now a standard step of LLM training. But is it really finding relevant documents or is it a powerful benchmark leaks finder ? As part of Gaperon work, we made a tool to find out!👇

Nathan Godey

@nthngdy

9 days

Thrilled to release Gaperon, a fully open LLM suite for French, English and Coding 🧀 We trained 3 models - 1.5B, 8B, 24B - from scratch on 2-4 trillion tokens of custom data (TL; DR: we cheat and get good scores) @wissam_antoun @riantouchent @RABawden @bensagot @zehavoc

1

4

Wissam Antoun

@wissam_antoun

9 days

One last thing! In our pretraining dataset, we added a harmless trigger sequence, which "poisons" the model and make it switch language. This part of our ongoing work on LLM weaponization and safety! Stay tuned...

0

2

Wissam Antoun

@wissam_antoun

9 days

Training LLMs from scratch is no easy feat, but it’s becoming easier as open-source tooling and know-how evolve. If I had to redo it again, we’d focus: - synth data - pre-training observability and recovery - abstracting evals and score reporting even further.

1

0

5

Wissam Antoun

@wissam_antoun

9 days

More details on our classifier inference engine based on AMD’s MIGraphX, and on our post-training and SFT approach are all available in our paper. Paper link: https://t.co/4baDYSkZNs Model Collection:

huggingface.co

1

3

Wissam Antoun

@wissam_antoun

9 days

Applying our semantic filter to Txt360 was inferior to just using FineWeb-edu @huggingface. This supports our theory (confirmed by https://t.co/mug5CZ0wHj) that FWedu is already benchmark-aligned. To balance this, we added diversity by mixing in the Txt360 top 10% scoring docs.

Alex Wettig

@_awettig

9 months

🤔 Ever wondered how prevalent some type of web content is during LM pre-training? In our new paper, we propose WebOrganizer which *constructs domains* based on the topic and format of CommonCrawl web pages 🌐 Key takeaway: domains help us curate better pre-training data! 🧵/N

1

0

3

Wissam Antoun

@wissam_antoun

9 days

In combination with the Head-Middle-Tail labels from perplexity score included in the RedPajama dataset, we bin our dataset into three quality buckets: - Head-High (290B Tokens) - Head-Medium (98B) - Middle-High (327B) We discarded the rest

1

0

2

Wissam Antoun

@wissam_antoun

9 days

We started from the French RedPajamaV2. We first filter and dedup it from 5.8T tokens down to 822B. We then trained our own semantic quality classifier with 500K labels from LLama3 70B prompted to classify the general document quality based on a set of criteria (in the photo).

1

0

2

Wissam Antoun

@wissam_antoun

9 days

This has been brewing for a while. After a year of hard work, our relatively small team is releasing our French-English LLM suite - Gaperon. We curated a French-focused pretraining dataset of 700B Tokens. More details 👇 @nthngdy @riantouchent @RABawden @bensagot @zehavoc

Nathan Godey

@nthngdy

9 days

Thrilled to release Gaperon, a fully open LLM suite for French, English and Coding 🧀 We trained 3 models - 1.5B, 8B, 24B - from scratch on 2-4 trillion tokens of custom data (TL; DR: we cheat and get good scores) @wissam_antoun @riantouchent @RABawden @bensagot @zehavoc

1

4

7

Defend Intelligence (Anis Ayari)

@DFintelligence

11 days

POV : tu essayes de faire une annonce qui fait peur en mode « bouuuh l’IA, 1,5 M », mais tu te rends compte trop tard que 1,5 million de foyers, quotidiennement, c’est absolument que dalle. En comparaison, ça veut dire que toute l’IA mondiale consomme chaque jour autant que

L'Humanité

@humanite_fr

12 days

Réchauffement climatique : l’IA générative consommerait quotidiennement autant d’énergie que 1,5 million de foyers ➡️ https://t.co/LWzc5L4RPv

79

430

5K

Andy Masley

@AndyMasley

1 month

Andy's Iron Law: Media outlets are physically incapable of comparing AI water use to any other industry. They only compare it to massive multiples of personal household use. All AI water use in Scotland is less than a single car factory uses.

BBC Scotland News

@BBCScotlandNews

1 month

Scottish data centres powering AI are already using enough water to fill 27 million bottles a year. More on this story ➡️ https://t.co/tHHEwafmTO

104

395

6K

Loïck BOURDOIS

@BdsLoick

1 month

@huggingface @GoogleAI @AIatMeta @Nils_Reimers @tomaarsen @wightmanr @OpenAI @MIT @Microsoft @jonatasgrosman @pyannoteAI @hbredin @BAAIBeijing @Alibaba_Qwen @amazon @cardiffnlpgroup @StabilityAI @MaziyarPanahi @HelsinkiNLP @laion_ai @perezjotaeme @allen_ai @tohoku_nlp @mrm8488 @MistralAI @prajjwal_1 @deepset_ai @salesforce @TheBlokeAI @Emily_Alsentzer @nvidia @lmstudio @bartowski1182 @limsanity23 @UnslothAI @MoritzLaurer #41 Joint Laboratory of HIT and iFLYTEK Research (HFL) #42 @deepseek_ai #43 @BigscienceW #44 flair #45 @sam_lowe #46 Patrick John Chia #47 @InriaParisNLP @louismrt + @wissam_antoun #48 @supabase #49 @JinaAI_ #50 @lateinteraction

0

1

4

Loïck BOURDOIS

@BdsLoick

1 month

New blog post analyzing the top 50 entities with the most downloaded models on @huggingface 🤗! The purpose here is to get an idea of the profile of the models with the greatest impact in open source (we are not interested in closed models here!). Some key findings:

7

24

126

Andy Masley

@AndyMasley

1 month

Every day I find a new way of trying to get across just how ridiculously fake the problem of AI water use is

.TEIJI

@justTEIJI

2 months

Yeah but the form of AI that uses the most water and electricity is by far, ChatGPT…. You can start SOMEWHERE… the whole “i can’t do it cus it’s a lot i gotta cut off” is just an excuse to not care forreal

206

613

6K

Ali Hariri

@haririAli95

2 months

⭐️Return of ChebNet is a Spotlight at NeurIPS 2025! • Revives ChebNet for long-range graph tasks • Identifies instability in high-order polynomial filters ⚡ • Introduces Stable-ChebNet, a non-dissipative system for controlled, stable info flow! 📄

arxiv.org

ChebNet, one of the earliest spectral GNNs, has largely been overshadowed by Message Passing Neural Networks (MPNNs), which gained popularity for their simplicity and effectiveness in capturing...

1

4

11

Jeff Dean

@JeffDean

3 months

AI efficiency is important. Today, Google is sharing a technical paper detailing our comprehensive methodology for measuring the environmental impact of Gemini inference. We estimate that the median Gemini Apps text prompt uses 0.24 watt-hours of energy (equivalent to watching an

153

826

4K

Wissam Antoun

@wissam_antoun

3 months

A few months before the v1 models, in november 2019, we trained a Bert model entirely on Colab TPUv2 . For a week, we would wake up at night just to restart the notebook.

3

0

4

Wissam Antoun

@wissam_antoun

3 months

Fun fact: it costed us $10 only for Colab. Used for prepping and tokenizing OSCAR. Training was done on TFRC. I asked Google for access to a 128TPUv3 pod from a random gmail address, and they actually gave it to us. We trained 13 models with it, including a 1.5B Arabic GPT2

1

0

3

Wissam Antoun

@wissam_antoun

3 months

Surprised to see our (@fadybaly) Arabic BERT model from 4 years ago as the TOP 10 most finetuned model on the @huggingface hub. It now has ~9M total downloads, with ~600K monthly. Thread/Paper: https://t.co/5CNnj2fUWd

clem 🤗

@ClementDelangue

3 months

Fun to think about open-source models and their variants as families from an evolutionary biology standpoint and analyze "genetic similarity and mutation of traits over model families". These are the 2,500th, 250th, 50th and 25th largest families on @huggingface:

3

6

23

ArabicNLP2025

@_ArabicNLP

4 months

We are excited to learn that ArabicNLP is 🔥 #17 🔥among the top publication venues in “Computational Linguistics” according to Google’s 2025 Scholar Metrics. The ranking is calculated based on the citation counts of the papers published within the last 5 years, (2020-2025)

0

6

22