Yixing Jiang @ NeurIPS
@jyx_su
Followers
912
Following
58
Media
4
Statuses
19
PhD student at Stanford | Stanford Machine Learning Group, HealthRex Lab | National Science Scholar | Previously at Google Deepmind, SmarterDx
Stanford, CA
Joined June 2022
NeurIPS received 21,575 paper submissions this year. Our Agentic Reviewer, released last week, just surpassed this in number of papers submitted and reviewed. It's clear agentic paper reviewing is here to stay and will be impactful!
Releasing a new "Agentic Reviewer" for research papers. I started coding this as a weekend project, and @jyx_su made it much better. I was inspired by a student who had a paper rejected 6 times over 3 years. Their feedback loop -- waiting ~6 months for feedback each time -- was
68
281
2K
Excited to share that "Agentic Reviewer" (developed by @AndrewYNg and me) has reviewed more papers than the entire NeurIPS 2025 submission count (21,575). Thank you for the enthusiasm from 160 countries! We are glad that over 95% of you found the generated reviews useful, and
2
2
13
Releasing a new "Agentic Reviewer" for research papers. I started coding this as a weekend project, and @jyx_su made it much better. I was inspired by a student who had a paper rejected 6 times over 3 years. Their feedback loop -- waiting ~6 months for feedback each time -- was
237
1K
6K
We highlight the fundamental shift from AI as a tool to AI as a teammate in our recent multi-agent benchmarking study that measures leading large language models in their ability to carry out tasks in medicine: Full study: https://t.co/zudgYwcvnT
@StanfordAILab @StanfordMed
๐ฉบ + ๐ค Introducing MedAgentBench: EHR Environment to Benchmark Medical LLM Agents โ
300 agent tasks (beyond QA) from 10 physician-written categories โ
100 realistic patient profiles with 700,000+ data elements โ
FHIR-compliant interactive environment for translation
3
4
8
We further break the tasks into three difficulty levels: easy (only one step), medium (two steps) and hard (at least three steps). Most models achieve a lower SR when the tasks need more steps. Gemini 1.5 Pro is one exception as it achieves the highest SR for hard tasks.
1
0
9
MedAgentBench presents an unsaturated agent-oriented benchmark that current state-of-the-art LLMs exhibit some ability to succeed at. The best model (Claude 3.5 Sonnet v2) achieves a success rate of 69.67%. However, there is still substantial space for improvement.
3
3
23
Example successful trajectory and common error patterns in MedAgentBench.
0
0
3
The MedAgentBench workflow begins with a clinician specifying a high-level task, after which the agent orchestrator interacts with both the LLM provider and the electronic medical record environment to finish the task and finally provide feedback to the clinician.
1
0
6
Website: https://t.co/X4hcAs4a9p Paper: https://t.co/isG4louSr8 GitHub: https://t.co/NV7PcYqcQ1 Joint work with @KameronBlack633 @DannySungi9920 @GloriaGeng_ and supervised by @james_y_zou @AndrewYNg @jonc101x in @StanfordAILab
github.com
MedAgentBench: A Realistic Virtual EHR Environment to Benchmark Medical LLM Agents - stanfordmlgroup/MedAgentBench
0
1
18
๐ฉบ + ๐ค Introducing MedAgentBench: EHR Environment to Benchmark Medical LLM Agents โ
300 agent tasks (beyond QA) from 10 physician-written categories โ
100 realistic patient profiles with 700,000+ data elements โ
FHIR-compliant interactive environment for translation
24
53
213
Congrats @GoogleDeepMind on the Gemma-2-2B release! Gemma-2-2B has been tested in the Arena under "guava-chatbot". With just 2B parameters, it achieves an impressive score 1130 on par with models 10x its size! (For reference: GPT-3.5-Turbo-0613: 1117, Mixtral-8x7b: 1114). This
Weโre welcoming a new 2 billion parameter model to the Gemma 2 family. ๐ ๏ธ It offers best-in-class performance for its size and can run efficiently on a wide range of hardware. Developers can get started with 2B today โ https://t.co/hQRWYwGY7q
17
100
627
Today, weโre releasing Gemma 2 to researchers and developers globally. Available in both 9 billion and 27 billion parameter sizes, itโs much more powerful and efficient than the first generation. Learn more โ
blog.google
Gemma 2, our next generation of open models, is now available globally for researchers and developers.
68
234
1K
@AndrewYNg @StanfordAILab @jonc101x @JeffDean Try it on your own! Code here ๐ ๏ธ: https://t.co/ELC2MPaK5i Paper: https://t.co/rKOG9z7YyT Big thanks to my awesome colleagues: @jeremy_irvin16 @ji_hun_wang @mahmed_ch
#AI #MachineLearning
arxiv.org
Large language models are effective at few-shot in-context learning (ICL). Recent advancements in multimodal foundation models have enabled unprecedentedly long context windows, presenting an...
0
0
3
๐Want an image classifier within minutes? Just prompt latest models like GPT-4o and Gemini 1.5 with a bunch of demo examples (you can include thousands of them now) and ask multiple queries in one go! Work with @AndrewYNg at @StanfordAILab and @jonc101x . Inspired by @JeffDean
Want to unlock the full potential of GPT-4o & Gemini 1.5 Pro? Give them **many** demonstrations and batch your queries! Our new work w/ @AndrewYNg shows these models can benefit substantially from lots of demo examples and asking many questions at once! https://t.co/SeAqBRFyy7
4
12
40
Many-Shot In-Context Learning in Multimodal Foundation Models Large language models are well-known to be effective at few-shot in-context learning (ICL). Recent advancements in multimodal foundation models have enabled unprecedentedly long context windows, presenting an
3
55
276