rkdsaakyan Profile Banner
Arkadiy Saakyan Profile
Arkadiy Saakyan

@rkdsaakyan

Followers
174
Following
699
Media
19
Statuses
63

PhD student @ColumbiaCompSci @columbianlp working on human-AI collaboration, AI creativity and explainability. prev. intern @GoogleDeepMind, @AmazonScience

Manhattan, NY
Joined September 2021
Don't wanna be here? Send us removal request.
@rkdsaakyan
Arkadiy Saakyan
4 days
N-gram novelty is widely used as a measure of creativity and generalization. But if LLMs produce highly n-gram novel expressions that don’t make sense or sound awkward, should they still be called creative? In a new paper, we investigate how n-gram novelty relates to creativity.
1
13
43
@sarahwiegreffe
Sarah Wiegreffe
4 days
I am recruiting 2 PhD students to work on LM interpretability at UMD @umdcs starting in fall 2026! We are #3 in AI and #4 in NLP research on @CSrankings. Come join us in our lovely building just a few miles from Washington, D.C. Details in 🧵
13
157
722
@rkdsaakyan
Arkadiy Saakyan
4 days
On OOD dataset StyleMirror, we find that LLM-Judge novelty scores are associated with expert preferences to a larger extent than a previously proposed n-gram novelty metric, Creativity Index, suggesting our operationalization yields a more aligned metric for textual creativity.
1
0
1
@rkdsaakyan
Arkadiy Saakyan
4 days
Writing quality reward model scores are associated with both creativity and pragmaticality judgements, but are not interpretable. LLM-judge can replicate some expert novelty judgements but struggle with identifying non-pragmatic expressions.
1
0
1
@rkdsaakyan
Arkadiy Saakyan
4 days
In a follow-up study with GPT-5 and Claude, we observe that the rate of human-judged creative expression in AI-written text is significantly lower than in human-written text.
1
0
1
@rkdsaakyan
Arkadiy Saakyan
4 days
Further, we find that both open source models tested, OLMo-1 and 2 of 7B and 32B size, exhibit a negative relationship between n-gram novelty and pragmaticality. As open-source LLMs try to generate text not present in data, their expressions tend to make less sense in context.
1
0
1
@rkdsaakyan
Arkadiy Saakyan
4 days
N-gram novelty is not a reliable metric of creativity: over *90%* of top-quartile n-gram novelty expressions were not judged as creative. We find many examples of low n-gram novelty expressions rated creative and high n-gram novelty expressions rated as non-pragmatic.
1
0
1
@rkdsaakyan
Arkadiy Saakyan
4 days
We recruit expert writers with MFA/MA/PhD background. They rated expressions in human- and AI-generated (from fully (code + DATA) open-source OLMo models @allenai) passages for if they make sense, are pragmatic, and are novel; they could also highlight any creative expressions.
1
0
1
@rkdsaakyan
Arkadiy Saakyan
4 days
The standard definition of creativity states the product has to be both novel AND appropriate. Similarly, we operationalize textual creativity as human-judged expression novelty AND sensicality (making sense by itself) + pragmaticality (making sense in context).
1
0
1
@jennajrussell
Jenna Russell
17 days
AI is already at work in American newsrooms. We examine 186k articles published this summer and find that ~9% are either fully or partially AI-generated, usually without readers having any idea. Here's what we learned about how AI is influencing local and national journalism:
4
52
143
@TuhinChakr
Tuhin Chakrabarty
18 days
🚨New paper on AI and copyright Several authors have sued LLM companies for allegedly using their books without permission for model training. 👩‍⚖️Courts, however, require empirical evidence of harm (e.g., market dilution). Our new pre-registered study addresses exactly this
9
171
524
@joodalooped
judah
3 months
frontier model still worse than text-davinci-001 who would have thought?
82
121
2K
@METR_Evals
METR
4 months
We ran a randomized controlled trial to see how much AI coding tools speed up experienced open-source developers. The results surprised us: Developers thought they were 20% faster with AI tools, but they were actually 19% slower when they had access to AI than when they didn't.
236
1K
7K
@sarahwiegreffe
Sarah Wiegreffe
5 months
A bit late to announce, but I’m excited to share that I'll be starting as an assistant professor at the University of Maryland @umdcs this August. I'll be recruiting PhD students this upcoming cycle for fall 2026. (And if you're a UMD grad student, sign up for my fall seminar!)
70
50
608
@chautmpham
Chau Minh Pham
5 months
🤔 What if you gave an LLM thousands of random human-written paragraphs and told it to write something new -- while copying 90% of its output from those texts? 🧟 You get what we call a Frankentext! 💡 Frankentexts are surprisingly coherent and tough for AI detectors to flag.
6
32
122
@vishakh_pk
Vishakh Padmakumar
6 months
What does it mean for #LLM output to be novel? In work w/ @jcyhc_ai, @JanePan_, @valeriechen_, @hhexiy we argue it needs to be both original and high quality. While prompting tricks trade one for the other, better models (scaling/post-training) can shift the novelty frontier 🧵
2
29
83
@rkdsaakyan
Arkadiy Saakyan
6 months
Even powerful models achieve only 50% explanation adequacy rate, suggesting difficulties in reasoning about figurative inputs. Hallucination & unsound reasoning are the most prominent error categories.
1
0
0
@rkdsaakyan
Arkadiy Saakyan
6 months
We find that: 1. VLMs struggle to generalize from literal to figurative meaning understanding (training on e-ViL only achieves random F1 on our task) 2. Figurative meaning in the image is harder to explain compared to when it is in the text 3. VLMs benefit from image data in FT
1
0
0