YiqingXieNLP Profile Banner
Yiqing Xie Profile
Yiqing Xie

@YiqingXieNLP

Followers
169
Following
129
Media
15
Statuses
65

✨ Synthetic data; Auto Eval; Code-Gen; 🎓 PhD student @LTIatCMU; MSCS @dmguiuc. 👩‍💻 previously Intern @meta; @MSFTResearch * 2; @AlibabaDAMO.

Joined September 2023
Don't wanna be here? Send us removal request.
@YiqingXieNLP
Yiqing Xie
5 months
How to construct repo-level coding environments in a scalable way?. Checkout RepoST: an automated framework to construct repo-level environments using Sandbox Testing (. Models trained with RepoST data can generalize well to other datasets (e.g., RepoEval)
Tweet media one
3
23
87
@YiqingXieNLP
Yiqing Xie
1 month
RT @nlpxuhui: Very excited to share that HAICosystem has been accepted to #COLM2025 ! 🎉. Multi-turn, interactive evaluation is THE future,….
0
15
0
@YiqingXieNLP
Yiqing Xie
1 month
RepoST was accepted to @COLM_conf !!! .See you in Montreal 🚀. #COLM2025.
@YiqingXieNLP
Yiqing Xie
5 months
How to construct repo-level coding environments in a scalable way?. Checkout RepoST: an automated framework to construct repo-level environments using Sandbox Testing (. Models trained with RepoST data can generalize well to other datasets (e.g., RepoEval)
Tweet media one
0
3
17
@YiqingXieNLP
Yiqing Xie
2 months
RT @lmathur_: Future AI systems interacting with humans will need to perform social reasoning that is grounded in behavioral cues and exter….
0
15
0
@YiqingXieNLP
Yiqing Xie
3 months
RT @shubhamrgandhi: 🚨New preprint🚨 .I’m super excited to share our work: An Empirical Study on Strong-Weak Model Collaboration for Repo-le….
Tweet card summary image
arxiv.org
We study cost-efficient collaboration between strong and weak language models for repository-level code generation, where the weak model handles simpler tasks at lower cost, and the most...
0
5
0
@YiqingXieNLP
Yiqing Xie
4 months
RT @GashonHussein: Excited to share our new paper, "One-Minute Video Generation with Test-Time Training (TTT)" in collaboration with NVIDIA….
0
159
0
@YiqingXieNLP
Yiqing Xie
5 months
RT @jacspringer: Training with more data = better LLMs, right? 🚨. False! Scaling language models by adding more pre-training data can decre….
0
184
0
@YiqingXieNLP
Yiqing Xie
5 months
If you’re interested in RepoST, checkout the:.- Paper: - Code & Data: Many thanks to my awesome collaborators: Alex Xie, @Divyanshu_Sheth, @stefan_fee, @dan_fried, @carolynprose!!.
Tweet card summary image
github.com
Code for "[COLM'25] RepoST: Scalable Repository-Level Coding Environment Construction with Sandbox Testing" - yiqingxyq/RepoST
0
0
5
@YiqingXieNLP
Yiqing Xie
5 months
Future work may include:.(1) Training and evaluating coding agents on RepoST-Train / RepoST-Eval. (2) Extending RepoST to multiple repo-level tasks. (3) Leveraging our datasets to further study the effect of data scale and context for code generation. (4) . .
1
0
3
@YiqingXieNLP
Yiqing Xie
5 months
We benchmark 12 Code LLMs on RepoST-Eval to evaluate their abilities to generate code in real GitHub repositories. The best model only achieves 39.53 Pass@1. We further conducted a human study on a sampled set, where the human participants solved 81.5% of the examples.
Tweet media one
1
0
2
@YiqingXieNLP
Yiqing Xie
5 months
The execution feedback of RepoST-Train enables us to apply rejection sampling to obtain training targets. The finetuned model can generalize well to other public benchmarks (e.g., we obtained performance gains of 5.49% Pass@1 on HumanEval and 3.49% Pass@1 on RepoEval)
Tweet media one
1
0
2
@YiqingXieNLP
Yiqing Xie
5 months
With the RepoST framework, we build a large-scale train set, RepoST-Train, with 7,415 functions sampled from 824 repositories. We also build RepoST-Eval. Note that RepoST is fully automated and can be potentially used to construct live benchmarks to avoid contamination issues.
Tweet media one
1
0
2
@YiqingXieNLP
Yiqing Xie
5 months
For quality control purposes, we iteratively resolve environment or runtime errors and improve test coverage. We also conduct execution-based, AST-based, and LLM-based quality checks. Human studies demonstrate high agreement between humans and LLM quality checkers.
1
0
2
@YiqingXieNLP
Yiqing Xie
5 months
Compared to integration testing used by previous datasets, our sandbox testing method is highly scalable: We only need to install packages for the target function and its necessary local dependencies, which is typically much simpler than building the entire repo.
1
0
3
@YiqingXieNLP
Yiqing Xie
5 months
Given a GitHub function, we sandbox it and its local dependencies to a separate script and generate tests with an LLM. When generating the target function, the model can access the entire GitHub repo. We then use the evaluation script to obtain execution feedback.
Tweet media one
1
0
3
@YiqingXieNLP
Yiqing Xie
5 months
RT @PranjalAggarw16: What if you could control how long a reasoning model “thinks”?. Presenting L1-1.5B, an RL-trained reasoning model with….
0
72
0
@YiqingXieNLP
Yiqing Xie
5 months
RT @FariaHuqOaishi: [1/6] 🤔 Ever wondered if you could collaborate with an agent on web tasks?. ​​We present CowPilot 🐮, a framework for hu….
0
50
0
@YiqingXieNLP
Yiqing Xie
6 months
RT @AutoScienceAI: Introducing Carl, the first AI system to create a research paper that passes peer review. Carl's work was just accepted….
0
34
0
@YiqingXieNLP
Yiqing Xie
7 months
RT @jiayi_pirate: We reproduced DeepSeek R1-Zero in the CountDown game, and it just works . Through RL, the 3B base LM develops self-verifi….
0
1K
0
@YiqingXieNLP
Yiqing Xie
7 months
RT @gaotianyu1350: Introducing MeCo (metadata conditioning then cooldown), a remarkably simple method that accelerates LM pre-training by s….
0
47
0