Prompsit
@Prompsit
Followers
591
Following
311
Media
538
Statuses
3K
We speak Natural Language Processing, Data Analysis and Artificial Intelligence, among many other languages!
Joined June 2011
Describing HPLT datasets in depth is an essential part of our commitment as data curators: 🆕HPLT 3.0: Very Large-Scale Multilingual Resources for LLM and MT. Mono- and Bi-lingual Data, Multilingual Evaluation, and Pre-Trained Models: https://t.co/uN2zoSF251 We are on🔥at #HPLT
arxiv.org
We present an ongoing initiative to provide open, very large, high-quality, and richly annotated textual datasets for almost 200 languages. At 30 trillion tokens, this is likely the largest...
0
6
8
The #HPLT crowd is at #EMNLP2025!!! If you are around, please visit our booth to discuss: - multilingual datasets 🌏 - dataset insights and stats 📊 - dataset performance 🔝 - efficient MT models ⏱️ - and the future of multilingual LLMs 💡 We don't want to miss U!
0
3
10
Gracias #PCUMH por insistir en que contemos lo que hacemos y por estar siempre atentos a nuestros avances y logros. Vuestro apoyo nos da visibilidad y alegrías como esta. ¡Gracias!
📢 El #PCUMH, finalista en los “Disruptores Innovation Awards 2025” de @elespanolcom . 🏆Ha sido seleccionado como "Mejor proyecto impulsado por parques tecnológicos" gracias a la empresa @Prompsit , parte de @OpenEuroLLM . Noticia completa🔽 https://t.co/sXH7wKY1Zs
0
0
0
Impossible oblidar el dia que vam conèixer a l'Olga Torres, aquell somriure que va fer de MultiTrainMT molt més que un projecte d'èxit quant als resultats: va fer pinya, va fer família. Eixe somriure ens acompanyarà sempre, DEP benvolguda amiga.
Kick-off meeting at @UABBarcelona of MultiTrainMT "Machine Translation training for multilingual citizens meeting" @EUErasmusPlus project. Feel free to follow/contact us for further info and/or becoming an associate partner. Anyone interested in the topic is most welcome!
0
0
1
We had a great time at @MTSummit2025 presenting work about HPLT v2 multilingual datasets (v3 coming soon!) and ProMut, an improved DYI platform to teach and learn about MT. Great to be there also to celebrate the Award of Honour to our co-founder, CRO and friend Mikel Forcada! 😍
0
0
2
Prompsit will actively participate in OpenEuroLLM by analysing and curating the open data needed to train the foundational LLM. We are also contributing to multilingual LLM evaluation and dissemination of it all!
0
1
1
We are happy to announce the second release of HPLT bilingual datasets: - 50 English-centric language pairs = 380M parallel sentences (HPLT) 🤩 - 1,275 non-English-centric language pairs = 16.7B parallel sentences (MultiHPLT) 😮 Available at the HPLT dataset catalogue and OPUS.
0
13
16
Fue un gusto participar en esta jornada. Gracias por la invitación @PcientificoUMH, nos gustó mucho compartir la jornada con las compañeras de @Prosperabiotech. ¡Tenemos unas científicas y tecnólogas excepcionales a la vuelta de cada esquina! 👩🔬👩💻💪🦾
Así ha sido la jornada sobre ciencia y tecnología en femenino organizada por el #ParqueCientífico de la @universidadmh para los estudiantes del @IES Victoria Kent 🧪🧬 Una sesión muy especial, promovida por @APTE y el #PCUMH, que ha contado con distintas charlas y talleres.
0
0
1
It's time for transparent AI in Europe. It's time for open LLMs as a robust foundation for developing future private and public AI services. It's time for: OPEN = open-source Euro = under EU regulations, representing EU values LLM = LLMs https://t.co/K5MlOVS7DX
0
1
1
Para contaros lo que estamos haciendo en SmartBiC, proyecto liderado por @Linguaserve, nuestro póster de la @EAMT_2024 vale más que mil palabras.
0
2
2
By harnessing web crawls 🕸️ from Internet Archive and CommonCrawl, researchers 🔎 from @EdinburghUni, @helsinkiuni, @UniOslo, @UniTurku, and @Prompsit unveil new #language resources aimed at enhancing language modeling and #MT training. https://t.co/QnYoPuy3hf
@OnadeGibert
slator.com
Researchers harness web crawls from Internet Archive and CommonCrawl to release new language resources.
0
3
4
Happy to share our latest MaCoCu paper, accepted at #LRECCOLING2024 @LrecColing #NLProc 🎉 We have linguists annotate the data *quality* of 4 well-known monolingual corpora (OSCAR, CC100, mC4 and MaCoCu) across 11 European low-resource languages. Link: https://t.co/Pgc7h6XhYj
1
3
31
➡️ La empresa del #ParqueCientífico de la @UniversidadMH, @Prompsit, colabora en un proyecto europeo sobre tecnologías del lenguaje de alto rendimiento con el objetivo de crear diferentes modelos de lenguaje y traducciones potentes. Noticia completa 📌: https://t.co/eDH9qQnsVi
0
3
3
First datasets, then models! Initial HPLT models (LLMs and MT) are out: https://t.co/2WSLZCOhX7, some still running 🏃 We explain what we are doing in the deliverables section: https://t.co/otZs9gF2Sc Meanwhile, we keep cooking IA peta-data-bytes 🥘, enriching, dashboarding 📊
hplt-project.org
A space that combines petabytes of natural language data with large-scale model training
1
14
31
Hoy cumplimos 18 años haciendo lo que más nos gusta en este cruce entre lenguas y tecnología. Gracias por vuestra confianza. Per molts anys Prompsit! Gràcies de tot cor pel vostre suport! Happy birthday to us! 🥳 Thanks for your trust, we'll keep doing our best!
0
1
3
We just published version 1.2 of HPLT datasets. What's new? - we fixed a bug in monolingual dedup, please redownload! 🛠️ - we filtered out very ugly monolingual documents🤮 - we anonymised the bilingual datasets🕵️♀️ https://t.co/vvJSbswjZR
hplt-project.org
A space that combines petabytes of natural language data with large-scale model training
0
4
12
Select, filter, visualize your data (OpusCleaner). Then schedule and train MT and LLMs consistently (OpusTrainer) with them. As part of the HPLT project, we build tools to make it easy. They are open-source and we encourage you to use them. More:
0
0
1
We are excited to share with you that we now provide 4 more massive monolingual corpora for under-resourced languages: you can access Icelandic, Ukrainian, Catalan and Greek #MaCoCu web corpora for free from the https://t.co/X31izGUnNy repository 😃
1
18
35
#MaCoCu crew is in Groningen these days! Walking towards great results of MaCoCu corpora evaluation and new MaCoCu language models for under-resourced languages 😁
0
2
13
Next June, 17th-25th, the #HPLT consortium will held a #hackathon around a set of topics related to corpora curation in Prague. Interested? Drop us a line and join! https://t.co/jyhLKKdsTQ
hplt-project.org
A space that combines petabytes of natural language data with large-scale model training
0
3
7