Hippolyte Gisserot-Boukhlef @gisship X Profile

Hippolyte Gisserot-Boukhlef

@gisship

Followers

56

Following

11

Media

7

Statuses

17

PhD student @CentraleSupelec. Previously @psl_univ, @MIT and @HECParis.

Joined August 2013

Don't wanna be here? Send us removal request.

Hippolyte Gisserot-Boukhlef

@gisship

4 months

Huge thanks to the dream team! @N1colAIs , @ManuelFaysse , @DuarteMRAlves, Emmanuel Malherbe, @andre_t_martins , Céline Hudelot, and @PierreColombo6 🙏 (8/8)

0

5

Hippolyte Gisserot-Boukhlef

@gisship

4 months

There is plenty more in the paper, go check it out! https://t.co/W1p5mjTTf2 You can also visit our project page on Hugging Face: https://t.co/BdmtsVSN8o (7/8)

huggingface.co

1

0

4

Hippolyte Gisserot-Boukhlef

@gisship

4 months

What if you are starting from an existing pretrained model? We show it is better to start from a CLM-pretrained checkpoint and adapt it with MLM, rather than continuing MLM on an MLM-pretrained model. (6/8)

1

0

3

Hippolyte Gisserot-Boukhlef

@gisship

4 months

Inspired by this finding, we try a hybrid CLM+MLM pretraining schedule. This two-phase pretraining strategy outperforms pure MLM, suggesting a simple yet powerful new recipe for encoder training. (5/8)

1

0

3

Hippolyte Gisserot-Boukhlef

@gisship

4 months

We first compare plain CLM with plain MLM. We show that while CLM is more data-efficient (converging faster early in training) and offers greater stability during downstream fine-tuning, MLM remains essential for achieving strong downstream performance. (4/7)

1

0

3

Hippolyte Gisserot-Boukhlef

@gisship

4 months

We run a large-scale controlled study: same model sizes, same pretraining data and wide downstream task suite. Examining two scenarios, pretraining from scratch and continued pretraining, we find that training with MLM alone is suboptimal. (3/8)

1

0

3

Hippolyte Gisserot-Boukhlef

@gisship

4 months

For years, MLM with bidirectional attention has been the standard for text representation learning. But lately, decoder-only models trained via causal language modeling (CLM) have been successfully repurposed as encoders. So, is MLM still the way to go? (2/8)

1

0

3

Hippolyte Gisserot-Boukhlef

@gisship

4 months

🚨 New paper drop: Should We Still Pretrain Encoders with Masked Language Modeling? We revisit a foundational question in NLP: Is masked language modeling (MLM) still the best way to pretrain encoder models for text representations? 📄 https://t.co/W1p5mjTTf2 (1/8)

arxiv.org

Learning high-quality text representations is fundamental to a wide range of NLP tasks. While encoder pretraining has traditionally relied on Masked Language Modeling (MLM), recent evidence...

1

4

25

Hippolyte Gisserot-Boukhlef

@gisship

2 years

Check out our code on GitHub: https://t.co/6Grs15vdRF #Research #MachineLearning #NLP #LLM #InformationRetrieval

github.com

Contribute to artefactory/abstention-reranker development by creating an account on GitHub.

0

1

Hippolyte Gisserot-Boukhlef

@gisship

2 years

Feel free to comment or message if you have any questions! Special thanks to @ManuelFaysse, Emmanuel Malherbe, @CelineHudelot and @PierreColombo6 for their precious advice and contributions.

2

0

2

Hippolyte Gisserot-Boukhlef

@gisship

2 years

We demonstrate that it requires only a small number of labeled instances to outperform data-free methods, perfectly addressing industrial constraints.

1

0

1

Hippolyte Gisserot-Boukhlef

@gisship

2 years

Our method outperforms heuristic-based baselines (data-free) in terms of normalized area under the performance-abstention curve, with very low computational overhead.

1

0

1

Hippolyte Gisserot-Boukhlef

@gisship

2 years

We propose a data-based method, providing a simple way to obtain a confidence score through a linear combination of the query-documents relevance scores returned by the LLM.

1

Hippolyte Gisserot-Boukhlef

@gisship

2 years

Abstention in IR has not been extensively explored in the literature. Some learning-based approaches exist but are computationally expensive, not always aligning well with real-world and industrial constraints (e.g., limited resources, black-box access in API services).

1

0

1

Hippolyte Gisserot-Boukhlef

@gisship

2 years

Neural Information Retrieval systems often make mistakes for various reasons (e.g., model used, user query quality, missing documents). 💡Idea: Instead of making inaccurate predictions, why not try to estimate the confidence level of our model and abstain if it's too low?

1

0

1

Hippolyte Gisserot-Boukhlef

@gisship

2 years

📢 Excited to share my latest work! Introducing "Towards Trustworthy Reranking: A Simple yet Effective Abstention Mechanism". Check it out! https://t.co/Zod2d6fK9d Key takeaways down below 👇

arxiv.org

Neural Information Retrieval (NIR) has significantly improved upon heuristic-based Information Retrieval (IR) systems. Yet, failures remain frequent, the models used often being unable to retrieve...

1

6