Ingo Ziegler
@IngoZiegler
Followers
112
Following
773
Media
10
Statuses
95
ELLIS PhD Student at University of Copenhagen (NLP, Representation Learning, Generative Modeling)
Copenhagen, Denmark
Joined November 2012
I will be at #EMNLP2025 to present our TACL paper on synthetic data generation as an Oral! 📅 Presentation: Wednesday, 5 November 🕠 Time: 5:30 PM local time 📍 Location: Hall A102–103 🤝 Project was done together with @akoksal_ @delliott @HinrichSchuetze See you in Suzhou 🇨🇳
@IngoZiegler will present a synthetic data generation framework that rewrites real retrieved documents into task-specific finetuning examples. CRAFT is more stable than existing techniques like SelfInstruct and EvolInstruct across several tasks. Paper: https://t.co/8CBshW6wwv
1
1
9
We show that structuring sequences of images and text in a multi-turn conversation style is very effective to improve the sequential reasoning ability of multimodal LLMs! Now accepted @wacv_official See you in Arizona🌵#WACV2026
Our paper ImageChain (with @IngoZiegler & @delliott) was accepted at #WACV2026! We explore how multimodal LLMs reason over sequences of images I’ll present it at the @_LXAI Workshop @NeurIPS 🇲🇽 (Nov 30 ~10:45 Mex City). Come chat if you’re there! 🫶 📄 https://t.co/iQdaNZcDsn
0
0
1
Can we find weight directions to modify LLM's behaviors? Our new paper proposes contrastive weight steering, an alternative to activation steering for modifying behaviors using small narrow distribution data 🕹️ 🧵👇
4
33
202
1/2) I am very happy to finally share something I have been working on and off for the past year: "The Information Dynamics of Generative Diffusion" This paper connects the entropy production, divergence of vector fields and spontaneous symmetry breaking in a unified framework
14
120
956
Excited to share that our paper "Multilingual Pretraining for Pixel Language Models" has been accepted to the #EMNLP2025 main conference! Please see the thread below and the paper itself for more details.
Announcing our recent work “Multilingual Pretraining for Pixel Language Models”! We introduce PIXEL-M4, a pixel language model pretrained on four visually & linguistically diverse scripts: English, Hindi, Ukrainian & Simplified Chinese. #NLProc
0
6
25
Announcing our recent work “Multilingual Pretraining for Pixel Language Models”! We introduce PIXEL-M4, a pixel language model pretrained on four visually & linguistically diverse scripts: English, Hindi, Ukrainian & Simplified Chinese. #NLProc
1
3
12
📄 Read the paper: https://t.co/JUIjOc1zhi 💻 Code: https://t.co/Lptkv6usfy 📂 Dataset: StoryFrames is now available on @huggingface: https://t.co/RI8jZbg2ly This work was done in collaboration with @danaesavi and @delliott (6/6)
huggingface.co
0
0
1
📊 To enable this task, we also introduce StoryFrames—a new dataset designed for sequential image reasoning! 🔹 8,881 curated samples from real-world videos 🔹 Human-annotated 🔹 Temporally coherent scene descriptions 🔹 Enables MLLMs to learn structured event progression (5/6)
1
0
0
🔥 ImageChain dominates across all conversation context lengths! 📈 Compared to current MLLMs & standard fine-tuning, ImageChain improves models from 3% baseline performance up to 19% in generating descriptions similar to human-written ground truths! (4/6)
1
0
0
🔍 ImageChain solves this by treating image sequences as structured multi-turn conversations. ✨ Key ideas ✅ Images are interleaved with textual descriptions ✅ Next-scene description task optimizes temporal understanding ✅ Instruction-tuning over multi-turn conversation (3/6)
1
0
0
Why does sequential reasoning matter? Most MLLMs process images independently, failing to capture temporal dependencies. This limits their ability to understand actions, predict future events, and perform well in real-world applications like robotics and storytelling. (2/6)
1
0
0
📢 New paper out! Today we shared our latest work on improving sequential reasoning in multimodal models! Introducing ImageChain 🖼️ ⛓️, a framework that models visual sequences as multi-turn conversations. 🧵(1/6)
Our new paper is out! 🖼️➡️📝 We introduce ImageChain, a framework that enhances multimodal LLMs with sequential image reasoning 📄 Arxiv: https://t.co/YGKVIn7NDG Work with @IngoZiegler and @delliott
#AI #Multimodal #NLP #MLLM #ComputerVision #ImageChain
1
0
5
This work was done in collaboration with @akoksal_, @delliott, and @HinrichSchuetze 🤝 🤗 Full collection of datasets and checkpoints hosted on @huggingface: https://t.co/wPSxbx4xQX 📄Paper: https://t.co/FTzTfZbSio 💻Code, Datasets, LoRAs: https://t.co/M8CElAbu3c (5/5)
github.com
[TACL, EMNLP 2025 Oral] Code, datasets, and checkpoints for the paper "CRAFT Your Dataset: Task-Specific Synthetic Dataset Generation Through Corpus Retrieval and Augmentation" - ...
0
0
5
Additional bonus: Strong out-of-domain generalization 💪🌍 CRAFT's synthetic datasets lead to more robust models with better generalization capabilities than training on in-domain datasets, even when those datasets are human-curated 🧑💻 (4/5)
1
0
3
Results highlights: 📊 Outperforms or matches instruction-following LLMs on QA tasks 📈 46 preference points improvement over human-curated data for summarization ⬆️ Consistent performance gains when scaling up data size (3/5)
1
0
3
🔍 How does CRAFT work? 1️⃣ User provides few-shots with task & desired format 2️⃣ Top-k retrieval finds relevant docs from public corpora 3️⃣ LLMs augment retrieved docs into synthetic samples 4️⃣ Use resulting dataset for fine-tuning ✅ Done (2/5)
1
0
3
📢Today we release CRAFT: Corpus Retrieval and Augmentation for Fine-Tuning CRAFT is a framework to generate synthetic, scalable, and task-specific datasets with LLMs 📚 It relies only on public corpora, similarity-search, and augmentation through in-context learning. (1/5)
1
1
28
[CL] CRAFT Your Dataset: Task-Specific Synthetic Dataset Generation Through Corpus Retrieval and Augmentation I Ziegler, A Köksal, D Elliott, H Schütze [University of Copenhagen & LMU Munich] (2024) https://t.co/gGx9xawzDR
1
9
30
CRAFT Your Dataset: Task-Specific Synthetic Dataset Generation Through Corpus Retrieval and Augmentation Presents a method for generating task-specific synthetic datasets using user-provided few-shot examples. 📝 https://t.co/Iz5nFXvyOA 👨🏽💻 https://t.co/O6UYmLygkI
0
30
126
🧪Did you miss or want to rewatch our captivating talk on Protein Language Models by @amelie_iska? Good news! The event is now available on YouTube for you to rewatch and dive into the fascinating world of ESM-2 and its variants. 👉Check it out here: https://t.co/GJC3NRdy2P
0
1
3