Joseph Jeesung Suh
@JosephJSSuh
Followers
57
Following
16
Media
7
Statuses
25
CS Grad student @ BAIR, UC Berkeley
Berkeley, CA
Joined June 2024
(11/11) For people who are interested, here is a link: Paper: https://t.co/WvMRy4DdjR Github: https://t.co/wEYPxMH4TU Huge thanks to my amazing PI @serinachang5 and collaborator @SuhongMoon.
github.com
GEMS: Rethinking LLM Human Simulation, When a Graph is What You Need - schang-lab/gems
0
0
5
(10/11) Takeaway 🥡 If your simulation task is a discrete choice with relational structure, try GEMS 💎 before spinning up a 70B param model. You might get similar (or better!) accuracy with a fraction of the compute and better debug-gability!
1
0
2
(9/11) This builds on our earlier work SubPOP 🍭 (ACL 2025 main), where fine-tuning LLMs on scaled survey data reduced human-LLM gaps by up to half and generalized to new subpopulations & topics. Now we ask: when is a graph what you need? SubPOP:
aclanthology.org
Joseph Suh, Erfan Jahanparast, Suhong Moon, Minwoo Kang, Serina Chang. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2025.
1
0
2
(8/11) Interpretability & transparency matter. Node embeddings from GEMS reveal latent dimensions, from public opinion ideologies to pricing sensitivities 🔍 Unlike LLMs, GEMS is trained in-house from scratch 🪟 removing risks of data leakage and biases from opaque pretraining
1
0
2
(7/11) Efficiency matters. Smaller models mean faster iteration, lower cost, and easier deployment for survey design, policy analysis, and decision support. 🚀 Also, it is much easier to scale up to larger datasets with 1000× smaller params and 100× less compute!
1
0
1
(6/11) Our datasets and settings: We test 3 settings - predicting missing responses (ie, imputation), new individuals, new questions - and 3 datasets, spanning public opinion, personality traits, economics experiments, and grammar skills.
1
0
1
(5/11) Key finding: A GNN that’s ~1000× smaller than LLMs matches or surpasses them on predicting human behaviors consistently across datasets and settings — while being far more interpretable and transparent. 💡
1
0
3
(4/11) Why graphs? Relational structure is the signal for many human behaviors: for example, a person who is ‘worried’ to ‘health effects of COVID-19’ would likely ‘often’ ‘watch public health news’. GEMS learns from those relations directly on graphs.
1
0
2
(3/11) Meet GEMS 💎 — Graph-basEd Models for human Simulation We cast human simulation as link prediction on a heterogeneous graph: nodes = individuals, subgroups, choices; edges = individual ↔ subgroup, individual ↔ choice. Simple, transparent, and fast. ⚡
1
0
2
(2/11) Why discrete‑choice? A lot of “human simulation” with LLMs is predicting which choice an individual would pick from a small set: • Respondents in opinion polls • Customers choosing one item over another • Game players with finite next actions • Students answering MCQs
1
0
3
LLMs have dominated recent work on simulating human behaviors. But do you really need them? In discrete‑choice settings, our answer is: not necessarily. A lightweight graph neural network (GNN) can match or beat strong LLM-based methods. Paper: https://t.co/WvMRy4DdjR 🧵👇
3
15
54
🤔 Do LLMs exhibit in-group↔out-group perceptions like us? ❓ Can they serve as faithful virtual subjects of human political partisans? Excited to share our paper on taking LLM virtual personas to the *next level* of depth! 🔗 https://t.co/LzeDAMtrEV 🧵
2
9
16
💡New preprint & Python package: We use sparse autoencoders to generate hypotheses from large text datasets. Our method, HypotheSAEs, produces interpretable text features that predict a target variable, e.g. features in news headlines that predict engagement. 🧵1/
10
33
135
New Paper: We unlock AI Evaluation with explanatory and predictive power through general ability scales! -Explains what common benchmarks really measure -Extracts explainable ability profiles of AI systems -Predicts performance for new task instances, in & out-of-distribution 🧵
3
26
78
For people who are interested, here is a link: Paper: https://t.co/zQ2klONwCM Github: https://t.co/r1v8QqSw0C This work would not have been possible without our amazing PI @serinachang5 and collaborators @erfan_jp, @SuhongMoon, @joshminwookang, and Prof. John Canny.
github.com
[ACL 2025 Long Main] Language Model Fine-Tuning on Scaled Survey Data for Predicting Distributions of Public Opinions - JosephJeesungSuh/subpop
0
1
9
Why does this matter? Researchers often need estimation of responses for unseen subpopulations or newly formulated questions (or both), especially in the early stages of survey design. Our approach helps fill these gaps when immediate large-scale human polling isn't available.
1
0
5
Beyond accuracy, generalization is crucial. Fine-tuned models exhibit stable prediction improvements for: • Unseen subpopulations (not in the fine-tuning data) • New survey topics • Different survey families (American Trends Panel → General Social Survey)
1
1
3
Key finding: Fine-tuning our LLMs drastically narrows the human-LLM opinion gap—by up to 46%. Even better, every subgroup sees consistent improvement, addressing previous concerns that LLM-based methods might favor certain demographics' opinions over others.
1
1
3
Meet SubPOP! 🍭 SubPOP is a dataset of 70K subpopulation-response pairs (6.5× larger than past work), curated from two major opinion survey families. We fine-tune LLMs on SubPOP to match their response distributions to those of human subjects.
2
1
4
However, there hasn't been a survey dataset that is: 1. large-scale, with expansive sets of survey data sufficient for fine-tuning LLMs 2. high quality, with careful filtering and curation 3. capable of evaluating model generalization across topics & styles
1
1
3