Yiqing Xie @YiqingXieNLP X Profile

Yiqing Xie

@YiqingXieNLP

Followers

169

Following

129

Media

15

Statuses

65

✨ Synthetic data; Auto Eval; Code-Gen; 🎓 PhD student @LTIatCMU; MSCS @dmguiuc. 👩‍💻 previously Intern @meta; @MSFTResearch * 2; @AlibabaDAMO.

Joined September 2023

Don't wanna be here? Send us removal request.

Yiqing Xie

@YiqingXieNLP

5 months

How to construct repo-level coding environments in a scalable way?. Checkout RepoST: an automated framework to construct repo-level environments using Sandbox Testing (. Models trained with RepoST data can generalize well to other datasets (e.g., RepoEval)

3

23

87

Yiqing Xie

@YiqingXieNLP

1 month

RT @nlpxuhui: Very excited to share that HAICosystem has been accepted to #COLM2025 ! 🎉. Multi-turn, interactive evaluation is THE future,….

0

15

0

Yiqing Xie

@YiqingXieNLP

1 month

RepoST was accepted to @COLM_conf !!! .See you in Montreal 🚀. #COLM2025.

Yiqing Xie

@YiqingXieNLP

5 months

How to construct repo-level coding environments in a scalable way?. Checkout RepoST: an automated framework to construct repo-level environments using Sandbox Testing (. Models trained with RepoST data can generalize well to other datasets (e.g., RepoEval)

0

3

17

Yiqing Xie

@YiqingXieNLP

2 months

RT @lmathur_: Future AI systems interacting with humans will need to perform social reasoning that is grounded in behavioral cues and exter….

0

15

0

Yiqing Xie

@YiqingXieNLP

3 months

RT @shubhamrgandhi: 🚨New preprint🚨 .I’m super excited to share our work: An Empirical Study on Strong-Weak Model Collaboration for Repo-le….

arxiv.org

We study cost-efficient collaboration between strong and weak language models for repository-level code generation, where the weak model handles simpler tasks at lower cost, and the most...

0

5

0

Yiqing Xie

@YiqingXieNLP

4 months

RT @GashonHussein: Excited to share our new paper, "One-Minute Video Generation with Test-Time Training (TTT)" in collaboration with NVIDIA….

0

159

0

Yiqing Xie

@YiqingXieNLP

5 months

RT @jacspringer: Training with more data = better LLMs, right? 🚨. False! Scaling language models by adding more pre-training data can decre….

0

184

0

Yiqing Xie

@YiqingXieNLP

5 months

If you’re interested in RepoST, checkout the:.- Paper: - Code & Data: Many thanks to my awesome collaborators: Alex Xie, @Divyanshu_Sheth, @stefan_fee, @dan_fried, @carolynprose!!.

github.com

Code for "[COLM'25] RepoST: Scalable Repository-Level Coding Environment Construction with Sandbox Testing" - yiqingxyq/RepoST

0

5

Yiqing Xie

@YiqingXieNLP

5 months

Future work may include:.(1) Training and evaluating coding agents on RepoST-Train / RepoST-Eval. (2) Extending RepoST to multiple repo-level tasks. (3) Leveraging our datasets to further study the effect of data scale and context for code generation. (4) . .

1

0

3

Yiqing Xie

@YiqingXieNLP

5 months

We benchmark 12 Code LLMs on RepoST-Eval to evaluate their abilities to generate code in real GitHub repositories. The best model only achieves 39.53 Pass@1. We further conducted a human study on a sampled set, where the human participants solved 81.5% of the examples.

1

0

2

Yiqing Xie

@YiqingXieNLP

5 months

The execution feedback of RepoST-Train enables us to apply rejection sampling to obtain training targets. The finetuned model can generalize well to other public benchmarks (e.g., we obtained performance gains of 5.49% Pass@1 on HumanEval and 3.49% Pass@1 on RepoEval)

1

0

2

Yiqing Xie

@YiqingXieNLP

5 months

With the RepoST framework, we build a large-scale train set, RepoST-Train, with 7,415 functions sampled from 824 repositories. We also build RepoST-Eval. Note that RepoST is fully automated and can be potentially used to construct live benchmarks to avoid contamination issues.

1

0

2

Yiqing Xie

@YiqingXieNLP

5 months

For quality control purposes, we iteratively resolve environment or runtime errors and improve test coverage. We also conduct execution-based, AST-based, and LLM-based quality checks. Human studies demonstrate high agreement between humans and LLM quality checkers.

1

0

2

Yiqing Xie

@YiqingXieNLP

5 months

Compared to integration testing used by previous datasets, our sandbox testing method is highly scalable: We only need to install packages for the target function and its necessary local dependencies, which is typically much simpler than building the entire repo.

1

0

3

Yiqing Xie

@YiqingXieNLP

5 months

Given a GitHub function, we sandbox it and its local dependencies to a separate script and generate tests with an LLM. When generating the target function, the model can access the entire GitHub repo. We then use the evaluation script to obtain execution feedback.

1

0

3

Yiqing Xie

@YiqingXieNLP

5 months

RT @PranjalAggarw16: What if you could control how long a reasoning model “thinks”?. Presenting L1-1.5B, an RL-trained reasoning model with….

0

72

0

Yiqing Xie

@YiqingXieNLP

5 months

RT @FariaHuqOaishi: [1/6] 🤔 Ever wondered if you could collaborate with an agent on web tasks?. We present CowPilot 🐮, a framework for hu….

0

50

0

Yiqing Xie

@YiqingXieNLP

6 months

RT @AutoScienceAI: Introducing Carl, the first AI system to create a research paper that passes peer review. Carl's work was just accepted….

0

34

0

Yiqing Xie

@YiqingXieNLP

7 months

RT @jiayi_pirate: We reproduced DeepSeek R1-Zero in the CountDown game, and it just works . Through RL, the 3B base LM develops self-verifi….

0

1K

0

Yiqing Xie

@YiqingXieNLP

7 months

RT @gaotianyu1350: Introducing MeCo (metadata conditioning then cooldown), a remarkably simple method that accelerates LM pre-training by s….

0

47

0