Wissam Antoun
@wissam_antoun
Followers
1K
Following
3K
Media
50
Statuses
489
PhD at ALMAnaCH/Inria Paris, @aubmindlab Alumni Interested in AI, NLP, Video Games https://t.co/l4sRNquzX7
Paris, France
Joined March 2011
Excited to share our latest research paper: "From Text to Source: Results in Detecting Large Language Model-Generated Content" We research cross-model detection and model attribution, covering a wide range of LLM sizes and families. Paper: https://t.co/WKCqUANUg0 A thread🧵
2
6
26
🇪🇺 More great news from Europe 😊 Gradually, then suddenly. Nothing changed for 2 years and now a lot of things are finally changing: The Netherlands is changing its stock options tax to be modeled after the American system, which is the default in startups (!) Stock options
@levelsio FYI: Point 7 is scheduled to be implemented in NL.
110
142
2K
Using neural classifiers as quality filters for pretraining data is now a standard step of LLM training. But is it really finding relevant documents or is it a powerful benchmark leaks finder ? As part of Gaperon work, we made a tool to find out!👇
Thrilled to release Gaperon, a fully open LLM suite for French, English and Coding 🧀 We trained 3 models - 1.5B, 8B, 24B - from scratch on 2-4 trillion tokens of custom data (TL; DR: we cheat and get good scores) @wissam_antoun @riantouchent @RABawden @bensagot @zehavoc
1
4
4
One last thing! In our pretraining dataset, we added a harmless trigger sequence, which "poisons" the model and make it switch language. This part of our ongoing work on LLM weaponization and safety! Stay tuned...
0
0
2
Training LLMs from scratch is no easy feat, but it’s becoming easier as open-source tooling and know-how evolve. If I had to redo it again, we’d focus: - synth data - pre-training observability and recovery - abstracting evals and score reporting even further.
1
0
5
More details on our classifier inference engine based on AMD’s MIGraphX, and on our post-training and SFT approach are all available in our paper. Paper link: https://t.co/4baDYSkZNs Model Collection:
huggingface.co
1
1
3
Applying our semantic filter to Txt360 was inferior to just using FineWeb-edu @huggingface. This supports our theory (confirmed by https://t.co/mug5CZ0wHj) that FWedu is already benchmark-aligned. To balance this, we added diversity by mixing in the Txt360 top 10% scoring docs.
🤔 Ever wondered how prevalent some type of web content is during LM pre-training? In our new paper, we propose WebOrganizer which *constructs domains* based on the topic and format of CommonCrawl web pages 🌐 Key takeaway: domains help us curate better pre-training data! 🧵/N
1
0
3
In combination with the Head-Middle-Tail labels from perplexity score included in the RedPajama dataset, we bin our dataset into three quality buckets: - Head-High (290B Tokens) - Head-Medium (98B) - Middle-High (327B) We discarded the rest
1
0
2
We started from the French RedPajamaV2. We first filter and dedup it from 5.8T tokens down to 822B. We then trained our own semantic quality classifier with 500K labels from LLama3 70B prompted to classify the general document quality based on a set of criteria (in the photo).
1
0
2
This has been brewing for a while. After a year of hard work, our relatively small team is releasing our French-English LLM suite - Gaperon. We curated a French-focused pretraining dataset of 700B Tokens. More details 👇 @nthngdy @riantouchent @RABawden @bensagot @zehavoc
Thrilled to release Gaperon, a fully open LLM suite for French, English and Coding 🧀 We trained 3 models - 1.5B, 8B, 24B - from scratch on 2-4 trillion tokens of custom data (TL; DR: we cheat and get good scores) @wissam_antoun @riantouchent @RABawden @bensagot @zehavoc
1
4
7
POV : tu essayes de faire une annonce qui fait peur en mode « bouuuh l’IA, 1,5 M », mais tu te rends compte trop tard que 1,5 million de foyers, quotidiennement, c’est absolument que dalle. En comparaison, ça veut dire que toute l’IA mondiale consomme chaque jour autant que
Réchauffement climatique : l’IA générative consommerait quotidiennement autant d’énergie que 1,5 million de foyers ➡️ https://t.co/LWzc5L4RPv
79
430
5K
Andy's Iron Law: Media outlets are physically incapable of comparing AI water use to any other industry. They only compare it to massive multiples of personal household use. All AI water use in Scotland is less than a single car factory uses.
Scottish data centres powering AI are already using enough water to fill 27 million bottles a year. More on this story ➡️ https://t.co/tHHEwafmTO
104
395
6K
@huggingface @GoogleAI @AIatMeta @Nils_Reimers @tomaarsen @wightmanr @OpenAI @MIT @Microsoft @jonatasgrosman @pyannoteAI @hbredin @BAAIBeijing @Alibaba_Qwen @amazon @cardiffnlpgroup @StabilityAI @MaziyarPanahi @HelsinkiNLP @laion_ai @perezjotaeme @allen_ai @tohoku_nlp @mrm8488 @MistralAI @prajjwal_1 @deepset_ai @salesforce @TheBlokeAI @Emily_Alsentzer @nvidia @lmstudio @bartowski1182 @limsanity23 @UnslothAI @MoritzLaurer #41 Joint Laboratory of HIT and iFLYTEK Research (HFL) #42 @deepseek_ai
#43 @BigscienceW
#44 flair #45 @sam_lowe
#46 Patrick John Chia #47 @InriaParisNLP @louismrt + @wissam_antoun
#48 @supabase
#49 @JinaAI_
#50 @lateinteraction
0
1
4
New blog post analyzing the top 50 entities with the most downloaded models on @huggingface 🤗! The purpose here is to get an idea of the profile of the models with the greatest impact in open source (we are not interested in closed models here!). Some key findings:
7
24
126
Every day I find a new way of trying to get across just how ridiculously fake the problem of AI water use is
Yeah but the form of AI that uses the most water and electricity is by far, ChatGPT…. You can start SOMEWHERE… the whole “i can’t do it cus it’s a lot i gotta cut off” is just an excuse to not care forreal
206
613
6K
⭐️Return of ChebNet is a Spotlight at NeurIPS 2025! • Revives ChebNet for long-range graph tasks • Identifies instability in high-order polynomial filters ⚡ • Introduces Stable-ChebNet, a non-dissipative system for controlled, stable info flow! 📄
arxiv.org
ChebNet, one of the earliest spectral GNNs, has largely been overshadowed by Message Passing Neural Networks (MPNNs), which gained popularity for their simplicity and effectiveness in capturing...
1
4
11
AI efficiency is important. Today, Google is sharing a technical paper detailing our comprehensive methodology for measuring the environmental impact of Gemini inference. We estimate that the median Gemini Apps text prompt uses 0.24 watt-hours of energy (equivalent to watching an
153
826
4K
A few months before the v1 models, in november 2019, we trained a Bert model entirely on Colab TPUv2 . For a week, we would wake up at night just to restart the notebook.
3
0
4
Fun fact: it costed us $10 only for Colab. Used for prepping and tokenizing OSCAR. Training was done on TFRC. I asked Google for access to a 128TPUv3 pod from a random gmail address, and they actually gave it to us. We trained 13 models with it, including a 1.5B Arabic GPT2
1
0
3
Surprised to see our (@fadybaly) Arabic BERT model from 4 years ago as the TOP 10 most finetuned model on the @huggingface hub. It now has ~9M total downloads, with ~600K monthly. Thread/Paper: https://t.co/5CNnj2fUWd
Fun to think about open-source models and their variants as families from an evolutionary biology standpoint and analyze "genetic similarity and mutation of traits over model families". These are the 2,500th, 250th, 50th and 25th largest families on @huggingface:
3
6
23
We are excited to learn that ArabicNLP is 🔥 #17 🔥among the top publication venues in “Computational Linguistics” according to Google’s 2025 Scholar Metrics. The ranking is calculated based on the citation counts of the papers published within the last 5 years, (2020-2025)
0
6
22