Arkadiy Saakyan @rkdsaakyan X Profile

Arkadiy Saakyan

@rkdsaakyan

Followers

174

Following

699

Media

19

Statuses

63

PhD student @ColumbiaCompSci @columbianlp working on human-AI collaboration, AI creativity and explainability. prev. intern @GoogleDeepMind, @AmazonScience

https://t.co/8zNJxnZECe

Manhattan, NY

Joined September 2021

Don't wanna be here? Send us removal request.

Arkadiy Saakyan

@rkdsaakyan

4 days

N-gram novelty is widely used as a measure of creativity and generalization. But if LLMs produce highly n-gram novel expressions that don’t make sense or sound awkward, should they still be called creative? In a new paper, we investigate how n-gram novelty relates to creativity.

1

13

43

Sarah Wiegreffe

@sarahwiegreffe

4 days

I am recruiting 2 PhD students to work on LM interpretability at UMD @umdcs starting in fall 2026! We are #3 in AI and #4 in NLP research on @CSrankings. Come join us in our lovely building just a few miles from Washington, D.C. Details in 🧵

13

157

722

Arkadiy Saakyan

@rkdsaakyan

4 days

See more details in the paper! Paper link: https://t.co/OA0f43WIBv Github link:

github.com

Repository for the paper Death of the Novel(ty): Beyond n-Gram Novelty as a Metric for Textual Creativity - asaakyan/ngram-creativity

0

2

Arkadiy Saakyan

@rkdsaakyan

4 days

On OOD dataset StyleMirror, we find that LLM-Judge novelty scores are associated with expert preferences to a larger extent than a previously proposed n-gram novelty metric, Creativity Index, suggesting our operationalization yields a more aligned metric for textual creativity.

1

0

1

Arkadiy Saakyan

@rkdsaakyan

4 days

Writing quality reward model scores are associated with both creativity and pragmaticality judgements, but are not interpretable. LLM-judge can replicate some expert novelty judgements but struggle with identifying non-pragmatic expressions.

1

0

1

Arkadiy Saakyan

@rkdsaakyan

4 days

In a follow-up study with GPT-5 and Claude, we observe that the rate of human-judged creative expression in AI-written text is significantly lower than in human-written text.

1

0

1

Arkadiy Saakyan

@rkdsaakyan

4 days

Further, we find that both open source models tested, OLMo-1 and 2 of 7B and 32B size, exhibit a negative relationship between n-gram novelty and pragmaticality. As open-source LLMs try to generate text not present in data, their expressions tend to make less sense in context.

1

0

1

Arkadiy Saakyan

@rkdsaakyan

4 days

N-gram novelty is not a reliable metric of creativity: over *90%* of top-quartile n-gram novelty expressions were not judged as creative. We find many examples of low n-gram novelty expressions rated creative and high n-gram novelty expressions rated as non-pragmatic.

1

0

1

Arkadiy Saakyan

@rkdsaakyan

4 days

We recruit expert writers with MFA/MA/PhD background. They rated expressions in human- and AI-generated (from fully (code + DATA) open-source OLMo models @allenai) passages for if they make sense, are pragmatic, and are novel; they could also highlight any creative expressions.

1

0

1

Arkadiy Saakyan

@rkdsaakyan

4 days

The standard definition of creativity states the product has to be both novel AND appropriate. Similarly, we operationalize textual creativity as human-judged expression novelty AND sensicality (making sense by itself) + pragmaticality (making sense in context).

1

0

1

Jenna Russell

@jennajrussell

17 days

AI is already at work in American newsrooms. We examine 186k articles published this summer and find that ~9% are either fully or partially AI-generated, usually without readers having any idea. Here's what we learned about how AI is influencing local and national journalism:

4

52

143

Tuhin Chakrabarty

@TuhinChakr

18 days

🚨New paper on AI and copyright Several authors have sued LLM companies for allegedly using their books without permission for model training. 👩‍⚖️Courts, however, require empirical evidence of harm (e.g., market dilution). Our new pre-registered study addresses exactly this

9

171

524

judah

@joodalooped

3 months

frontier model still worse than text-davinci-001 who would have thought?

82

121

2K

METR

@METR_Evals

4 months

We ran a randomized controlled trial to see how much AI coding tools speed up experienced open-source developers. The results surprised us: Developers thought they were 20% faster with AI tools, but they were actually 19% slower when they had access to AI than when they didn't.

236

1K

7K

Sarah Wiegreffe

@sarahwiegreffe

5 months

A bit late to announce, but I’m excited to share that I'll be starting as an assistant professor at the University of Maryland @umdcs this August. I'll be recruiting PhD students this upcoming cycle for fall 2026. (And if you're a UMD grad student, sign up for my fall seminar!)

70

50

608

Chau Minh Pham

@chautmpham

5 months

🤔 What if you gave an LLM thousands of random human-written paragraphs and told it to write something new -- while copying 90% of its output from those texts? 🧟 You get what we call a Frankentext! 💡 Frankentexts are surprisingly coherent and tough for AI detectors to flag.

6

32

122

Vishakh Padmakumar

@vishakh_pk

6 months

What does it mean for #LLM output to be novel? In work w/ @jcyhc_ai, @JanePan_, @valeriechen_, @hhexiy we argue it needs to be both original and high quality. While prompting tricks trade one for the other, better models (scaling/post-training) can shift the novelty frontier 🧵

2

29

83

Arkadiy Saakyan

@rkdsaakyan

6 months

See more experiments and details in our paper: https://t.co/oWDjfHmdp3 And come see our poster at NAACL :) Joint work by Shreyas Kulkarni, @TuhinChakr , @SmaraMuresanNLP

arxiv.org

Large Vision-Language Models (VLMs) have demonstrated strong capabilities in tasks requiring a fine-grained understanding of literal meaning in images and text, such as visual question-answering...

0

Arkadiy Saakyan

@rkdsaakyan

6 months

Even powerful models achieve only 50% explanation adequacy rate, suggesting difficulties in reasoning about figurative inputs. Hallucination & unsound reasoning are the most prominent error categories.

1

0

Arkadiy Saakyan

@rkdsaakyan

6 months

We find that: 1. VLMs struggle to generalize from literal to figurative meaning understanding (training on e-ViL only achieves random F1 on our task) 2. Figurative meaning in the image is harder to explain compared to when it is in the text 3. VLMs benefit from image data in FT

1

0