Evidently AI
@EvidentlyAI
Followers
3K
Following
1K
Media
521
Statuses
2K
Open source ML and LLM evaluation ๐ , testing ๐ฆand monitoring ๐ GitHub: https://t.co/37H9bfnYj6 Discord: https://t.co/ElZ9RlroUa
Joined February 2020
3๏ธโฃ 2๏ธโฃ 1๏ธโฃ Our free course on LLM evaluations for AI product teams starts today! ๐ฅ 7 days of byte-sized videos into your inbox โญ๏ธ Certificate upon completion ๐ฉโ๐ป No coding skills required ๐ฉโ๐500+ students have signed up You can still join the course๐ https://t.co/Go2bNYJXCR
1
1
6
๐ In case you missed it โ๏ธ How to build LLM judges that mirror human judgment? Our tutorial shows how to design, test, and calibrate an evaluator for evaluating the quality of LLM-generated code reviews. You can adapt the example for your use case: https://t.co/G1vVvBrXeg
0
0
0
๐ฌ Writing good LLM prompts is tedious. Join us for a webinar on automated prompt optimization: ๐ฃ Intro to prompt optimization โ
How to improve prompts with Evidently Open-source ๐ป Live demo Thanks @DataTalksClub for the invite! Register here ๐ https://t.co/ajSePLx59i
luma.com
Improving LLM prompts using data-driven feedback optimization - Mikhail Sveshnikov Outline: Overview of prompt optimization challenges and common approachesโฆ
0
1
3
๐ In case you missed it 7 questions about LLM judges! We answered some of the most common questions we get about how LLM judges work and how to use them effectively ๐ https://t.co/LnCc8TyCDZ
evidentlyai.com
LLM-as-a-judge is a popular approach to testing and evaluating AI systems. We answered some of the most common questions about how LLM judges work and how to use them effectively.
0
0
0
A Friday ML use case ๐ ๐ From the database of 650 ML & LLM systems: https://t.co/jJoUj6MfFZ How Booking built an AI agent that assists partners by automatically suggesting a relevant response to each guest inquiry. https://t.co/isWy5be7iu
booking.ai
Authors: Ozan Sonmez, Bjorn Burscher, Klaus Schaefers, Basak Eskili
0
0
1
10 AI risks every team should be testing for! โ ๏ธ Top risks when building AI products โ How to mitigate them โ๏ธ Common AI risk assessment frameworks โ
How to set up a continuous AI testing workflow Check whether you test for all the critical risks: https://t.co/geKkIALIqq
0
2
3
๐ถ Detecting and reducing scheming in AI models AI models can โschemeโ โ act helpful while secretly pursuing other goals. Research from OpenAI and Apollo suggests โdeliberate alignmentโ as a new approach to reducing covert actions. Read the paper ๐ https://t.co/qQleAMICJ5
openai.com
Together with Apollo Research, we developed evaluations for hidden misalignment (โschemingโ) and found behaviors consistent with scheming in controlled tests across frontier models. We share examples...
0
0
0
๐ In case you missed it 7 agentic AI examples and use cases We explore seven agentic AI examples and use cases in the real world โ from transaction analysis to e-commerce recommendations to code review ๐ https://t.co/3J48HOhKYd
evidentlyai.com
In this blog, we will explore seven agentic AI examples and use cases in the real world โ from transaction analysis to e-commerce recommendations to code review.
0
0
0
A Friday ML use case ๐ ๐ From the database of 650 ML & LLM systems: https://t.co/jJoUj6MfFZ How LinkedIn engineered its Hiring Assistant: the architecture, design choices, and lessons learned from building an agentic product. https://t.co/hnA8O4SNrM
linkedin.com
1
1
1
๐ LLM tracing and dataset management are now live in Evidently open-source! The new release unlocks previously closed features: ๐ก Data storage backend โ
Raw dataset management and viewer ๐พ LLM tracing storage and viewer Try it now ๐ https://t.co/xMnU1xHb7O
1
2
3
โ ๏ธ How LLMs could be insider threats Anthropic stress-tested 16 LLMs and identified potentially risky agentic behaviors. Turns out, LLMs can blackmail and leak sensitive info as a means to avoid replacement or achieve their goals. Read the paper ๐ https://t.co/JeLFbbmqSP
anthropic.com
New research on simulated blackmail, industrial espionage, and other misaligned behaviors in LLMs
0
0
0
๐ In case you missed it Gen AI use cases in 2025: learnings from 650 examples. We highlighted some new patterns of how top companies apply Gen AI based on a database of real-world AI and ML use cases weโve been curating ๐ https://t.co/WfL237WONA
evidentlyai.com
Since 2023, we've been curating a database of real-world AI and ML use cases. Here is what we've learned from 650+ examples from top companies.
0
0
0
A Friday ML use case ๐ ๐ From the database of 650 ML & LLM systems: https://t.co/jJoUj6MfFZ How Thomson Reuters uses RAG to enhance customer service: support execs can retrieve relevant info from a curated database through a conversational interface. https://t.co/vnmboQoZzz
medium.com
High quality customer support is critical to business success. In this article, weโll explain how we employed an AI-powered solutionโฆ
0
0
0
โ๏ธ How to align LLM judge with human labels: new tutorial! We break down the process of designing, testing, and calibrating LLM evaluators. We also show how to create an LLM judge for evaluating the quality of LLM-generated code reviews๐ https://t.co/G1vVvBrXeg
1
3
4
๐ฆ Why language models hallucinate OpenAIโs research suggests that standard training and evaluation reward guessing over acknowledging uncertainty. Read the paper: https://t.co/9znwcKbcL6
openai.com
OpenAIโs new research explains why language models hallucinate. The findings show how improved evaluations can enhance AI reliability, honesty, and safety.
0
0
0
๐ In case you missed it 25 AI benchmarks: examples of AI models evaluation A brief explainer of what AI benchmarks are and why we need them, with 25 examples of common AI benchmarks for reasoning, conversation abilities, coding, RAG, and tool use ๐ https://t.co/xPonNn2CN4
evidentlyai.com
In this blog, weโll explore AI benchmarks and why we need them. Weโll also provide 25 examples of widely used AI benchmarks for reasoning and language understanding, conversation abilities, coding,...
0
1
2
A Friday ML use case ๐ ๐ From the database of 650 ML & LLM systems: https://t.co/jJoUj6MfFZ How Grab combines vector similarity search with LLMs to enhance the relevance and accuracy of search results for complex queries. https://t.co/FBPXG5cLNw
engineering.grab.com
Vector similarity search has revolutionised data retrieval, particularly in the context of Retrieval-Augmented Generation in conjunction with advanced Large Language Models (LLMs). However, it...
0
0
0
โ 7 questions about LLM judges! We answered some of the most common questions we get about how LLM judges work and how to use them effectively ๐ https://t.co/LnCc8TyCDZ
0
0
0
๐ In case you missed it 8 AI hallucinations examples ๐ฆ We put together eight examples of real-world AI hallucinations โ from a transcription tool fabricating texts to citing made-up company policies. Explore the examples ๐ https://t.co/Mz8QQhoy7C
evidentlyai.com
AI hallucinations come in different forms: from giving factually incorrect responses to inventing nonexistent product features or even people. We compiled eight real-life AI hallucination examples.
1
2
3
๐ What's your go-to MLOps stack? The State of MLOps Survey is live โ and weโre excited to see the first results, where Evidently is the most popular tool for ML monitoring ๐ฅ Kudos to @AxSaucedo for this insightful research! You can still vote here ๐ https://t.co/Rsv1dMgDnd
0
1
1
A Friday ML use case ๐ ๐ From the database of 650 ML & LLM systems: https://t.co/jJoUj6MfFZ How Instacart helps users find new products by incorporating LLMs into the search stack to generate discovery-oriented content. https://t.co/kFtDrOtHzu
tech.instacart.com
Authors: Taesik Na, Yuanzheng Zhu, Vinesh Gudla, Jeff Wu, Tejaswi Tenneti Key contributors: Akshay Nair, Benwen Sun, Chakshu Ahuja, Jesseโฆ
0
0
0