Yutong (Kelly) He
@electronickale
Followers
928
Following
1K
Media
17
Statuses
117
PhD student @mldcmu, I’m so delusional that doing generative modeling is my job
Pittsburgh, PA
Joined March 2021
✨ Love 4o-style image generation but prefer to use Midjourney? Tired of manual prompt crafting from inspo images? PRISM to the rescue! 🖼️→📝→🖼️ We automate black-box prompt engineering—no training, no embeddings, just accurate, readable prompts from your inspo images! 1/🧵
3
31
84
New work: “GLASS Flows: Transition Sampling for Alignment of Flow and Diffusion Models”. GLASS generates images by sampling stochastic Markov transitions with ODEs - allowing us to boost text-image alignment for large-scale models at inference time! https://t.co/unsuG3mYer [1/7]
3
58
241
OneFlow is a native multimodal model using just insertions. Scales better than Transfusion across a range of benchmarks up to 8B. New capabilities: naturally hierarchical generation, CFG for text detailedness, concurrent image+text interleaved, & more. https://t.co/LeDEiyRnQd
oneflow.framer.ai
OneFlow, the first non-autoregressive multimodal model that enables variable-length and concurrent mixed-modal generation.
Transfusion combines autoregressive with diffusion to train a single transformer, but what if we combine Flow with Flow? 🤔 🌊OneFlow🌊 the first non-autoregressive model to generate text and images concurrently using a single transformer—unifying Edit Flow (text) with Flow
2
19
150
Consistency models, CTMs, shortcut models, align your flow, mean flow... What's the connection, and how should you learn them in practice? We show they're all different sides of the same coin connected by one central object: the flow map. https://t.co/QBp1kELVhF 🧵(1/n)
5
69
341
🚨Excited to introduce a major development in building safer language models: Safety Pretraining! Instead of post-hoc alignment, we take a step back and embed safety directly into pretraining. 🧵(1/n)
7
90
342
LLMs lose diversity after RL post-training, and this hurts test-time scaling & creativity. Why does this collapse happen, and how can we fix it? Our new work introduces: 🔍 RL as Sampling (analysis) 🗺️ Outcome-based Exploration (intervention) [1/n]
9
87
469
Excited to share our work Set Block Decoding! A new paradigm combining next-token-prediction and masked (or discrete diffusion) models, allowing parallel decoding without any architectural changes and with exact KV cache. Arguably one of the simplest ways to accelerate LLMs!
3
24
115
Holy shi this is legit one of most creative papers I’ve read in a while
🎨 Super impressive. Researchers built an optical diffusion model that uses light, not heavy computer math, to generate images with quality similar to standard AI diffusion but at near-zero power. This work shows a light powered image generator that produces images with almost
0
0
1
At this point, I propose to cancel rebuttal altogether, you only get one shot, don’t miss your chance to blow, opportunity comes once in a lifetime, bro
6
1
40
My rebuttal a year ago: Thank you so much for your thoughtful feedbacks and we would like to address your concerns below… My rebuttal last time: Here is our results. And I won’t thank you because I wont have space. My rebuttal this time: Trust me bro
2
1
49
Padding in our non-AR sequence models? Yuck. 🙅 👉 Instead of unmasking, our new work *Edit Flows* perform iterative refinements via position-relative inserts and deletes, operations naturally suited for variable-length sequence generation. Easily better than using mask tokens.
8
80
518
RL with verifiable reward has shown impressive results in improving LLM reasoning, but what can we do when we do not have ground truth answers? Introducing Self-Rewarding Training (SRT): where language models provide their own reward for RL training! 🧵 1/n
21
146
843
LLOKI (a variant of Loki):
Integrating spatial transcriptomics across platforms is hard - different gene panels, sparse data. We introduce LLOKI, using optimal transport + single-cell FMs for unified ST integration. Work led by @ellie_haber (@mldcmu) & Lane Fellow @SpencerKrieger
1
1
8
When the ddl is approaching and you are violently editing something you wrote a while ago
0
0
17
Check out our paper for all the details! 🔗 Project page: https://t.co/ZWzBaF0QVM 📄 arXiv: https://t.co/gShCTkrWia 💻 Code: https://t.co/5sU0tGgkm1 w/ @AlexRobey23 @smiurtitkii @yidingjiang @jnwilliams_cmu @pappasg69 @HamedSHassani @mittu1204 @rsalakhu @zicokolter 8/8
github.com
The official implementation of PRISM: Automated Black-box Prompt Engineering for Personalized Text-to-Image Generation - KellyYutongHe/prism_demo
1
1
6
🧩 Multi-concept generation becomes intuitive too! PRISM lets you easily identify and combine different components from reference images into coherent scenes. 7/🧵
1
0
2
✏️ Want to tweak your generated image? You can simply edit PRISM-generated prompts directly! Change outfits, poses or backgrounds, just modify the part of the prompt you want, no more mysterious embeddings or gibberish prompts! 6/🧵
1
0
4
📊 In our experiments, PRISM outperforms/matches baselines across popular text-to-image models (Stable Diffusion, DALL-E, Midjourney) on T2I personalization, creating human-readable prompts with superior visual accuracy—no more manual prompt engineering headaches! 5/🧵
1
0
2
🧠 Our solution: Inspired by LLM jailbreaking (yup, really), PRISM iteratively refines human-readable prompts via in-context learning from VLMs. A magic trio, Prompt Engineer Assistant (VLM), any T2I Generator, and Judge (another VLM), powers this feedback loop! 4/🧵
1
0
3