Hippolyte Gisserot-Boukhlef
@gisship
Followers
56
Following
11
Media
7
Statuses
17
PhD student @CentraleSupelec. Previously @psl_univ, @MIT and @HECParis.
Joined August 2013
Huge thanks to the dream team! @N1colAIs , @ManuelFaysse , @DuarteMRAlves, Emmanuel Malherbe, @andre_t_martins , Cรฉline Hudelot, and @PierreColombo6 ๐ (8/8)
0
0
5
There is plenty more in the paper, go check it out! https://t.co/W1p5mjTTf2 You can also visit our project page on Hugging Face: https://t.co/BdmtsVSN8o (7/8)
huggingface.co
1
0
4
What if you are starting from an existing pretrained model? We show it is better to start from a CLM-pretrained checkpoint and adapt it with MLM, rather than continuing MLM on an MLM-pretrained model. (6/8)
1
0
3
Inspired by this finding, we try a hybrid CLM+MLM pretraining schedule. This two-phase pretraining strategy outperforms pure MLM, suggesting a simple yet powerful new recipe for encoder training. (5/8)
1
0
3
We first compare plain CLM with plain MLM. We show that while CLM is more data-efficient (converging faster early in training) and offers greater stability during downstream fine-tuning, MLM remains essential for achieving strong downstream performance. (4/7)
1
0
3
We run a large-scale controlled study: same model sizes, same pretraining data and wide downstream task suite. Examining two scenarios, pretraining from scratch and continued pretraining, we find that training with MLM alone is suboptimal. (3/8)
1
0
3
For years, MLM with bidirectional attention has been the standard for text representation learning. But lately, decoder-only models trained via causal language modeling (CLM) have been successfully repurposed as encoders. So, is MLM still the way to go? (2/8)
1
0
3
๐จ New paper drop: Should We Still Pretrain Encoders with Masked Language Modeling? We revisit a foundational question in NLP: Is masked language modeling (MLM) still the best way to pretrain encoder models for text representations? ๐ https://t.co/W1p5mjTTf2 (1/8)
arxiv.org
Learning high-quality text representations is fundamental to a wide range of NLP tasks. While encoder pretraining has traditionally relied on Masked Language Modeling (MLM), recent evidence...
1
4
25
Feel free to comment or message if you have any questions! Special thanks to @ManuelFaysse, Emmanuel Malherbe, @CelineHudelot and @PierreColombo6 for their precious advice and contributions.
2
0
2
We demonstrate that it requires only a small number of labeled instances to outperform data-free methods, perfectly addressing industrial constraints.
1
0
1
Our method outperforms heuristic-based baselines (data-free) in terms of normalized area under the performance-abstention curve, with very low computational overhead.
1
0
1
We propose a data-based method, providing a simple way to obtain a confidence score through a linear combination of the query-documents relevance scores returned by the LLM.
1
1
1
Abstention in IR has not been extensively explored in the literature. Some learning-based approaches exist but are computationally expensive, not always aligning well with real-world and industrial constraints (e.g., limited resources, black-box access in API services).
1
0
1
Neural Information Retrieval systems often make mistakes for various reasons (e.g., model used, user query quality, missing documents). ๐กIdea: Instead of making inaccurate predictions, why not try to estimate the confidence level of our model and abstain if it's too low?
1
0
1
๐ข Excited to share my latest work! Introducing "Towards Trustworthy Reranking: A Simple yet Effective Abstention Mechanism". Check it out! https://t.co/Zod2d6fK9d Key takeaways down below ๐
arxiv.org
Neural Information Retrieval (NIR) has significantly improved upon heuristic-based Information Retrieval (IR) systems. Yet, failures remain frequent, the models used often being unable to retrieve...
1
1
6