OpenCompass
@OpenCompassX
Followers
277
Following
64
Media
45
Statuses
80
OpenCompass focus on the evaluation and analysis of large language models and vision language models. github: https://t.co/zF7ycuTXxs
China
Joined April 2024
OpenCompass is onboard for Twitter (@X), focusing intently on the evaluation and analysis of Large Language Models and Vision-Language Models. Welcome to star our project:
github.com
OpenCompass is an LLM evaluation platform, supporting a wide range of models (Llama3, Mistral, InternLM2,GPT-4,LLaMa2, Qwen,GLM, Claude, etc) over 100+ datasets. - open-compass/opencompass
0
1
6
😉Introduce VideoScience-Bench, a benchmark designed to evaluate undergraduate-level scientific understanding in video models. 🥰Comprises 200 carefully curated prompts spanning 14 topics and 103 concepts in physics and chemistry. 👇 https://t.co/WLj1BAXgVz
0
0
0
😀A new study finds that long prompts cause a fidelity–diversity trade-off in leading T2I models: more detail but reduced diversity. 😉To evaluate this issue, the authors introduce LPD-Bench and propose PromptMoG, a training-free approach that enhances diversity by sampling
0
0
1
😀SpatialSky-Bench, a comprehensive benchmark specifically designed to evaluate the spatial intelligence capabilities of VLMs in UAV navigation. 😉Two categories: Environmental Perception and Scene Understanding 😉13 subcategories, including bounding boxes, color, distance,
0
0
1
🥰VP-Bench, a benchmark for assessing MLLMs' capability in VP perception and utilization. 😊Two-stage evaluation framework: 1.Examines models' ability to perceive VPs in natural scenes, using 30k visualized prompts spanning eight shapes and 355 attribute combinations.
0
0
1
🥰OutSafe-Bench, the first most comprehensive content safety evaluation test suite designed for the multimodal era. 😍Covers 4 modalities: 18,000+ bilingual (ZH/EN) text prompts 4,500 images 450 audio clips 450 videos 👏OutSafe-Bench is now part of the Daily Benchmark.
0
1
2
🚀 OpenCompass Daily Benchmark is live! ✅ Daily updates of the latest AI evaluation papers ✅ AI-powered smart summaries ✅ Available in English & Chinese 😍Stay ahead of AI trends, key insights, and cutting-edge research—all in one place! 🔗 https://t.co/kPchXbAJWz
0
2
2
🔥China’s Open-source VLMs boom—Intern-S1, MiniCPM-V-4, GLM-4.5V, Step3, OVIS 🧐Join the AI Insight Talk with @huggingface, @OpenCompassX, @ModelScope2022 and @ZhihuFrontier 🚀Tech deep-dives & breakthroughs 🚀Roundtable debates ⏰Aug 21, 5 AM PDT 📺Live: https://t.co/brweSm4yT5
2
3
18
🚀 Introducing #CompassVerifier: A unified and robust answer verifier for #LLMs evaluation and #RLVR! ✨LLM progress is bottlenecked by weak evaluation, looking for an alternative to rule-based verifiers? CompassVerifier can handle multiple domains including math, science, and
0
1
5
🥳#CodeCriticBench assesses LLMs' critiquing ability in code generation and QA tasks. Covering 10 criteria, it features a 4.3k-samples dataset with three difficulty levels and balanced distribution. 😉CodeCriticBench is now part of the #CompassHub! 😚Feel free to download and
0
0
3
🥳#StructFlowBench is a structurally annotated multi-turn benchmark that leverages a structure-driven generation paradigm to enhance the simulation of complex dialogue scenarios. 🥳StructFlowBench is now part of the #CompassHub! 😉Feel free to download and explore it—available
0
1
3
😉#VBench is a comprehensive benchmark evaluates video generation quality. It comprises 16 dimensions in video generation, and also provides a dataset of human preference annotations. 🥳VBench is now part of the #CompassHub! Feel free to download and explore it—available for
0
0
2
🥰VLM²-Bench is the first comprehensive benchmark that evaluates vision-language models' (#VLMs) ability to visually link matching cues across multi-image sequences and videos. The benchmark consists of 9 subtasks with over 3,000 test cases. 🥳VLM²-Bench is now part of the
0
0
6
We've uploaded the AIME 2025 exam, complete with questions and solutions, here: https://t.co/f5Tv8rx0gX. Feel free to test your powerful LLM on this dataset.
huggingface.co
0
0
4
🌟 Exciting News! CompassArena now back with some major updates: - **Judge Copilot**: An LLM-as-a-Judge tool for model comparisons. 🤖 - **Enhanced Statistical Model**: Improved Bradley-Terry accuracy by addressing confounding variables. 📊 - **20+ New LLMs**: A global mix of
0
2
5
大语言模型具备稳定推理能力吗? 「来自上海 AI 实验室 @OpenCompassX 的研究,通过创新的评估方法揭示了一个关键问题:尽管大语言模型在单次测试中可能表现出色(如 OpenAI 最新模型单次准确率达 66.5%),但在需要持续稳定输出的场景中,几乎所有模型的表现都会大幅下降(降幅普遍超过
1
3
10
Welcome to submit your LMM into our new leaderboard.
OpenCompass has established a leaderboard to evaluate complex reasoning capability of LMMs, consisting of four advanced multi-modal math reasoning benchmarks. Currently, Gemini-2.0-Flash took the 1st place. DM me to suggest more benchmarks and models to this LB.
0
0
1
OpenCompass has established a leaderboard to evaluate complex reasoning capability of LMMs, consisting of four advanced multi-modal math reasoning benchmarks. Currently, Gemini-2.0-Flash took the 1st place. DM me to suggest more benchmarks and models to this LB.
0
2
8
🚀 Shocking : O1-mini scores just 15.6% on AIME under strict, real-world metrics. 🚨 📈 Introducing G-Pass@k: A metric that reveals LLMs' performance consistency across trials. 🌐 LiveMathBench: Challenging LLMs with contemporary math problems, minimizing data leaks. 🔍 Our
3
14
64
MMBench has been selected as one of the most influential papers at ECCV 2024, ranking second.🎉🎉🎉 https://t.co/oLXhoONN3P
0
1
1