Letta
@Letta_AI
Followers
4K
Following
575
Media
157
Statuses
452
Stateful agents that remember and learn https://t.co/4upATozfXj
San Francisco, CA
Joined August 2024
What if we evaluated agents less like isolated code snippets, and more like humans - where behavior depends on the environment and lived experiences? ๐งช Introducing ๐๐ฒ๐๐๐ฎ ๐๐๐ฎ๐น๐: a fully open source evaluation framework for stateful agents
4
4
55
The next Stateful Agents Meetup is about self-improving coding agents! Come learn about Letta Code and how you can use it to build infinitely long-lived agents that learn your codebase as they improve it. November 20th, 2025 in San Francisco! Register: https://t.co/Y2mG6Lclb2
0
0
1
ๆฏไธช AI ๆจกๅ้ฝ่ฝๅญฆไน ไฝฟ็จ Skills ๅ๏ผ @Letta_AI ๅๅธ Context-Bench Skills ่ฏๆตๅบๅ๏ผๆฅๆต่ฏ AI ๆจกๅ่ฝๅฆๅไบบ็ฑปไธๆ ท"ๆ้ๅญฆไน ๆ่ฝ"ใ ๆ ธๅฟ้ฎ้ข AI
Last week we launched Context-Bench, a new leaderboard that measures how good AI models are at Agentic Context Engineering. This week, we're expanding Context-Bench with a new addition: Context-Bench Skills.
0
9
29
Claude Skills might be the new MCP - but does it work outside of @AnthropicAI? Find out with the "Skills Suite" in Context-Bench, our benchmark for Agentic Context Engineering GPT-5 and GLM 4.6 excel at skill-use, but smaller models (e.g. GPT-5-mini) struggle
2
5
14
As part of our evaluation, we built skills into Letta Code, a model-agnostic harness that enables any LLM to leverage skills, so you can experiment with context mounting with any model. For more read our full write-up on evaluating skills:
letta.com
Today we're releasing Skill Use, a new evaluation suite inside of Context-Bench that measures how well models discover and load relevant skills from a library to complete tasks.
0
0
2
Anthropic recently released a set of open source skills to teach Claude to do various tasks, but they never released a quantitative evaluation of skill use. It turns out: many frontier models (not just Claude) are capable of effective skill acquisition.
1
0
0
Context-Bench Skills measures whether or not an agent is capable of acquiring and utilizing specialized knowledge. We call this concept "context mounting" - similar to how you mount a storage volume or USB drive to a computer.
1
0
0
Last week we launched Context-Bench, a new leaderboard that measures how good AI models are at Agentic Context Engineering. This week, we're expanding Context-Bench with a new addition: Context-Bench Skills.
1
2
13
The Letta Office Hours recording is now available. It covers: - V1 SDK breaking changes (snake case, pagination, shared archives, project scoping) - The AI Memory SDK v0.2 - Our agent Ezra - The Letta Code link/unlink feature - Agent scheduling https://t.co/3oabYZgoDp
0
1
3
โYou obviously cannot learn if you have no memory.โ @sarahwooders from Letta cuts to the core of why current LLM agents struggle to evolve beyond workflows. It's a fundamental limitation many builders are grappling with.
0
4
16
Last week we announced Letta Evals. Here's a video on how to use it. You'll learn simple Q&A testing, rubric-based grading, and multi-turn memory verification. https://t.co/eQgoeGVEhE
0
0
6
Context-Bench proves promising for the open source community: the gap between frontier open weights models and closed weights models appears to be closing. Read our breakdown of the benchmark at https://t.co/n7FaK4TLhh See the live leaderboard at https://t.co/GC8jS41nCf
0
0
4
Context-Bench also measures total cost to complete the benchmark. Surprisingly, raw token costs ($/million tokens) do not map directly to total cost. GPT-5 has lower per-token cost than Sonnet 4.5, but costs more in the benchmark because GPT-5 agents are more "token hungry".
1
0
3
Our goal in creating Context-Bench is to construct a benchmark that is (1) contamination proof, (2) measures "deep" multi-turn tool calling, (3) has controllable difficulty. In its present state, the benchmark is far from saturated - the top model (Sonnet 4.5) takes 74%.
1
0
3
Agentic context engineering is the new frontier in AI agent capabilities. Models that are post-trained specifically for context engineering excel at long-horizon tasks where the task length far exceeds the native context window of the LLMs themselves. So which models do it best?
1
0
3
Today we're releasing Context-Bench, an open benchmark for agentic context engineering. Context-Bench evaluates how well language models can chain file operations, trace entity relationships, and manage long-horizon multi-step tool calling.
3
3
19
We are changing the format of our weekly office hours. Now you'll be able to join us on YouTube. 11:30am PST on Thursdays, every week. Link below ๐๏ธ
1
0
3
Here's our team checking out the new office. We're looking forward to hosting you all for our meetups -- we finally have room!
1
1
19
We're hiring researchers & engineers at @Letta_AI to work on AI's hardest problem: memory. Join us to work on finding the right memory representations & learning methods (both in-context and in-weights) required to create self-improving AI systems with LLMs. We're an open AI
jobs.ashbyhq.com
Research Engineer / Research Scientist at Letta
1
4
22
Super excited about this release: Letta Evals is the first evals platform *purpose-built* for stateful agents. What does that actually mean? When you eval agents w/ Letta Evals, you can literally pull an agent out of production (by cloning a replica of its active state),
What if we evaluated agents less like isolated code snippets, and more like humans - where behavior depends on the environment and lived experiences? ๐งช Introducing ๐๐ฒ๐๐๐ฎ ๐๐๐ฎ๐น๐: a fully open source evaluation framework for stateful agents
2
4
27