wissam_antoun Profile Banner
Wissam Antoun Profile
Wissam Antoun

@wissam_antoun

Followers
1K
Following
3K
Media
50
Statuses
489

PhD at ALMAnaCH/Inria Paris, @aubmindlab Alumni Interested in AI, NLP, Video Games https://t.co/l4sRNquzX7

Paris, France
Joined March 2011
Don't wanna be here? Send us removal request.
@wissam_antoun
Wissam Antoun
2 years
Excited to share our latest research paper: "From Text to Source: Results in Detecting Large Language Model-Generated Content" We research cross-model detection and model attribution, covering a wide range of LLM sizes and families. Paper: https://t.co/WKCqUANUg0 A thread🧵
2
6
26
@levelsio
@levelsio
16 hours
🇪🇺 More great news from Europe 😊 Gradually, then suddenly. Nothing changed for 2 years and now a lot of things are finally changing: The Netherlands is changing its stock options tax to be modeled after the American system, which is the default in startups (!) Stock options
@bobbygaal
Bobby Gaal
1 day
@levelsio FYI: Point 7 is scheduled to be implemented in NL.
110
142
2K
@riantouchent
rian
1 day
Using neural classifiers as quality filters for pretraining data is now a standard step of LLM training. But is it really finding relevant documents or is it a powerful benchmark leaks finder ? As part of Gaperon work, we made a tool to find out!👇
@nthngdy
Nathan Godey
9 days
Thrilled to release Gaperon, a fully open LLM suite for French, English and Coding 🧀 We trained 3 models - 1.5B, 8B, 24B - from scratch on 2-4 trillion tokens of custom data (TL; DR: we cheat and get good scores) @wissam_antoun @riantouchent @RABawden @bensagot @zehavoc
1
4
4
@wissam_antoun
Wissam Antoun
9 days
One last thing! In our pretraining dataset, we added a harmless trigger sequence, which "poisons" the model and make it switch language. This part of our ongoing work on LLM weaponization and safety! Stay tuned...
0
0
2
@wissam_antoun
Wissam Antoun
9 days
Training LLMs from scratch is no easy feat, but it’s becoming easier as open-source tooling and know-how evolve. If I had to redo it again, we’d focus: - synth data - pre-training observability and recovery - abstracting evals and score reporting even further.
1
0
5
@wissam_antoun
Wissam Antoun
9 days
More details on our classifier inference engine based on AMD’s MIGraphX, and on our post-training and SFT approach are all available in our paper. Paper link: https://t.co/4baDYSkZNs Model Collection:
Tweet card summary image
huggingface.co
1
1
3
@wissam_antoun
Wissam Antoun
9 days
Applying our semantic filter to Txt360 was inferior to just using FineWeb-edu @huggingface. This supports our theory (confirmed by https://t.co/mug5CZ0wHj) that FWedu is already benchmark-aligned. To balance this, we added diversity by mixing in the Txt360 top 10% scoring docs.
@_awettig
Alex Wettig
9 months
🤔 Ever wondered how prevalent some type of web content is during LM pre-training? In our new paper, we propose WebOrganizer which *constructs domains* based on the topic and format of CommonCrawl web pages 🌐 Key takeaway: domains help us curate better pre-training data! 🧵/N
1
0
3
@wissam_antoun
Wissam Antoun
9 days
In combination with the Head-Middle-Tail labels from perplexity score included in the RedPajama dataset, we bin our dataset into three quality buckets: - Head-High (290B Tokens) - Head-Medium (98B) - Middle-High (327B) We discarded the rest
1
0
2
@wissam_antoun
Wissam Antoun
9 days
We started from the French RedPajamaV2. We first filter and dedup it from 5.8T tokens down to 822B. We then trained our own semantic quality classifier with 500K labels from LLama3 70B prompted to classify the general document quality based on a set of criteria (in the photo).
1
0
2
@wissam_antoun
Wissam Antoun
9 days
This has been brewing for a while. After a year of hard work, our relatively small team is releasing our French-English LLM suite - Gaperon. We curated a French-focused pretraining dataset of 700B Tokens. More details 👇 @nthngdy @riantouchent @RABawden @bensagot @zehavoc
@nthngdy
Nathan Godey
9 days
Thrilled to release Gaperon, a fully open LLM suite for French, English and Coding 🧀 We trained 3 models - 1.5B, 8B, 24B - from scratch on 2-4 trillion tokens of custom data (TL; DR: we cheat and get good scores) @wissam_antoun @riantouchent @RABawden @bensagot @zehavoc
1
4
7
@DFintelligence
Defend Intelligence (Anis Ayari)
11 days
POV : tu essayes de faire une annonce qui fait peur en mode « bouuuh l’IA, 1,5 M », mais tu te rends compte trop tard que 1,5 million de foyers, quotidiennement, c’est absolument que dalle. En comparaison, ça veut dire que toute l’IA mondiale consomme chaque jour autant que
@humanite_fr
L'Humanité
12 days
Réchauffement climatique : l’IA générative consommerait quotidiennement autant d’énergie que 1,5 million de foyers ➡️ https://t.co/LWzc5L4RPv
79
430
5K
@AndyMasley
Andy Masley
1 month
Andy's Iron Law: Media outlets are physically incapable of comparing AI water use to any other industry. They only compare it to massive multiples of personal household use. All AI water use in Scotland is less than a single car factory uses.
@BBCScotlandNews
BBC Scotland News
1 month
Scottish data centres powering AI are already using enough water to fill 27 million bottles a year. More on this story ➡️ https://t.co/tHHEwafmTO
104
395
6K
@BdsLoick
Loïck BOURDOIS
1 month
New blog post analyzing the top 50 entities with the most downloaded models on @huggingface 🤗! The purpose here is to get an idea of the profile of the models with the greatest impact in open source (we are not interested in closed models here!). Some key findings:
7
24
126
@AndyMasley
Andy Masley
1 month
Every day I find a new way of trying to get across just how ridiculously fake the problem of AI water use is
@justTEIJI
.TEIJI
2 months
Yeah but the form of AI that uses the most water and electricity is by far, ChatGPT…. You can start SOMEWHERE… the whole “i can’t do it cus it’s a lot i gotta cut off” is just an excuse to not care forreal
206
613
6K
@haririAli95
Ali Hariri
2 months
⭐️Return of ChebNet is a Spotlight at NeurIPS 2025! • Revives ChebNet for long-range graph tasks • Identifies instability in high-order polynomial filters ⚡ • Introduces Stable-ChebNet, a non-dissipative system for controlled, stable info flow! 📄
Tweet card summary image
arxiv.org
ChebNet, one of the earliest spectral GNNs, has largely been overshadowed by Message Passing Neural Networks (MPNNs), which gained popularity for their simplicity and effectiveness in capturing...
1
4
11
@JeffDean
Jeff Dean
3 months
AI efficiency is important. Today, Google is sharing a technical paper detailing our comprehensive methodology for measuring the environmental impact of Gemini inference. We estimate that the median Gemini Apps text prompt uses 0.24 watt-hours of energy (equivalent to watching an
153
826
4K
@wissam_antoun
Wissam Antoun
3 months
A few months before the v1 models, in november 2019, we trained a Bert model entirely on Colab TPUv2 . For a week, we would wake up at night just to restart the notebook.
3
0
4
@wissam_antoun
Wissam Antoun
3 months
Fun fact: it costed us $10 only for Colab. Used for prepping and tokenizing OSCAR. Training was done on TFRC. I asked Google for access to a 128TPUv3 pod from a random gmail address, and they actually gave it to us. We trained 13 models with it, including a 1.5B Arabic GPT2
1
0
3
@wissam_antoun
Wissam Antoun
3 months
Surprised to see our (@fadybaly) Arabic BERT model from 4 years ago as the TOP 10 most finetuned model on the @huggingface hub. It now has ~9M total downloads, with ~600K monthly. Thread/Paper: https://t.co/5CNnj2fUWd
@ClementDelangue
clem 🤗
3 months
Fun to think about open-source models and their variants as families from an evolutionary biology standpoint and analyze "genetic similarity and mutation of traits over model families". These are the 2,500th, 250th, 50th and 25th largest families on @huggingface:
3
6
23
@_ArabicNLP
ArabicNLP2025
4 months
We are excited to learn that ArabicNLP is 🔥 #17 🔥among the top publication venues in “Computational Linguistics” according to Google’s 2025 Scholar Metrics. The ranking is calculated based on the citation counts of the papers published within the last 5 years, (2020-2025)
0
6
22