OpenCompass @OpenCompassX X Profile

OpenCompass

@OpenCompassX

Followers

277

Following

64

Media

45

Statuses

80

OpenCompass focus on the evaluation and analysis of large language models and vision language models. github: https://t.co/zF7ycuTXxs

https://t.co/pqQo8qsoLn

China

Joined April 2024

Don't wanna be here? Send us removal request.

OpenCompass

@OpenCompassX

2 years

OpenCompass is onboard for Twitter (@X), focusing intently on the evaluation and analysis of Large Language Models and Vision-Language Models. Welcome to star our project:

github.com

OpenCompass is an LLM evaluation platform, supporting a wide range of models (Llama3, Mistral, InternLM2,GPT-4,LLaMa2, Qwen,GLM, Claude, etc) over 100+ datasets. - open-compass/opencompass

0

1

6

OpenCompass

@OpenCompassX

19 days

😉Introduce VideoScience-Bench, a benchmark designed to evaluate undergraduate-level scientific understanding in video models. 🥰Comprises 200 carefully curated prompts spanning 14 topics and 103 concepts in physics and chemistry. 👇 https://t.co/WLj1BAXgVz

0

OpenCompass

@OpenCompassX

26 days

😀A new study finds that long prompts cause a fidelity–diversity trade-off in leading T2I models: more detail but reduced diversity. 😉To evaluate this issue, the authors introduce LPD-Bench and propose PromptMoG, a training-free approach that enhances diversity by sampling

0

1

OpenCompass

@OpenCompassX

1 month

😀SpatialSky-Bench, a comprehensive benchmark specifically designed to evaluate the spatial intelligence capabilities of VLMs in UAV navigation. 😉Two categories: Environmental Perception and Scene Understanding 😉13 subcategories, including bounding boxes, color, distance,

0

1

OpenCompass

@OpenCompassX

1 month

🥰VP-Bench, a benchmark for assessing MLLMs' capability in VP perception and utilization. 😊Two-stage evaluation framework: 1.Examines models' ability to perceive VPs in natural scenes, using 30k visualized prompts spanning eight shapes and 355 attribute combinations.

0

1

OpenCompass

@OpenCompassX

1 month

🥰OutSafe-Bench, the first most comprehensive content safety evaluation test suite designed for the multimodal era. 😍Covers 4 modalities: 18,000+ bilingual (ZH/EN) text prompts 4,500 images 450 audio clips 450 videos 👏OutSafe-Bench is now part of the Daily Benchmark.

0

1

2

OpenCompass

@OpenCompassX

1 month

🚀 OpenCompass Daily Benchmark is live! ✅ Daily updates of the latest AI evaluation papers ✅ AI-powered smart summaries ✅ Available in English & Chinese 😍Stay ahead of AI trends, key insights, and cutting-edge research—all in one place! 🔗 https://t.co/kPchXbAJWz

0

2

OpenCompass

@OpenCompassX

1 month

📊The latest OpenCompass LLM Leaderboard is here! https://t.co/ajRjNAUXsz

0

2

OpenMMLab

@OpenMMLab

4 months

🔥China’s Open-source VLMs boom—Intern-S1, MiniCPM-V-4, GLM-4.5V, Step3, OVIS 🧐Join the AI Insight Talk with @huggingface, @OpenCompassX, @ModelScope2022 and @ZhihuFrontier 🚀Tech deep-dives & breakthroughs 🚀Roundtable debates ⏰Aug 21, 5 AM PDT 📺Live: https://t.co/brweSm4yT5

2

3

18

OpenCompass

@OpenCompassX

5 months

🚀 Introducing #CompassVerifier: A unified and robust answer verifier for #LLMs evaluation and #RLVR! ✨LLM progress is bottlenecked by weak evaluation, looking for an alternative to rule-based verifiers? CompassVerifier can handle multiple domains including math, science, and

0

1

5

OpenCompass

@OpenCompassX

10 months

🥳#CodeCriticBench assesses LLMs' critiquing ability in code generation and QA tasks. Covering 10 criteria, it features a 4.3k-samples dataset with three difficulty levels and balanced distribution. 😉CodeCriticBench is now part of the #CompassHub! 😚Feel free to download and

0

3

OpenCompass

@OpenCompassX

10 months

🥳#StructFlowBench is a structurally annotated multi-turn benchmark that leverages a structure-driven generation paradigm to enhance the simulation of complex dialogue scenarios. 🥳StructFlowBench is now part of the #CompassHub! 😉Feel free to download and explore it—available

0

1

3

OpenCompass

@OpenCompassX

10 months

😉#VBench is a comprehensive benchmark evaluates video generation quality. It comprises 16 dimensions in video generation, and also provides a dataset of human preference annotations. 🥳VBench is now part of the #CompassHub! Feel free to download and explore it—available for

0

2

OpenCompass

@OpenCompassX

10 months

🥰VLM²-Bench is the first comprehensive benchmark that evaluates vision-language models' (#VLMs) ability to visually link matching cues across multi-image sequences and videos. The benchmark consists of 9 subtasks with over 3,000 test cases. 🥳VLM²-Bench is now part of the

0

6

OpenCompass

@OpenCompassX

11 months

We've uploaded the AIME 2025 exam, complete with questions and solutions, here: https://t.co/f5Tv8rx0gX. Feel free to test your powerful LLM on this dataset.

huggingface.co

0

4

OpenCompass

@OpenCompassX

1 year

🌟 Exciting News! CompassArena now back with some major updates: - **Judge Copilot**: An LLM-as-a-Judge tool for model comparisons. 🤖 - **Enhanced Statistical Model**: Improved Bradley-Terry accuracy by addressing confounding variables. 📊 - **20+ New LLMs**: A global mix of

0

2

5

meng shao

@shao__meng

1 year

大语言模型具备稳定推理能力吗？「来自上海 AI 实验室 @OpenCompassX 的研究，通过创新的评估方法揭示了一个关键问题：尽管大语言模型在单次测试中可能表现出色（如 OpenAI 最新模型单次准确率达 66.5%），但在需要持续稳定输出的场景中，几乎所有模型的表现都会大幅下降（降幅普遍超过

1

3

10

OpenCompass

@OpenCompassX

1 year

Welcome to submit your LMM into our new leaderboard.

Haodong Duan

@KennyUTC

1 year

OpenCompass has established a leaderboard to evaluate complex reasoning capability of LMMs, consisting of four advanced multi-modal math reasoning benchmarks. Currently, Gemini-2.0-Flash took the 1st place. DM me to suggest more benchmarks and models to this LB.

0

1

Haodong Duan

@KennyUTC

1 year

OpenCompass has established a leaderboard to evaluate complex reasoning capability of LMMs, consisting of four advanced multi-modal math reasoning benchmarks. Currently, Gemini-2.0-Flash took the 1st place. DM me to suggest more benchmarks and models to this LB.

0

2

8

OpenCompass

@OpenCompassX

1 year

🚀 Shocking : O1-mini scores just 15.6% on AIME under strict, real-world metrics. 🚨 📈 Introducing G-Pass@k: A metric that reveals LLMs' performance consistency across trials. 🌐 LiveMathBench: Challenging LLMs with contemporary math problems, minimizing data leaks. 🔍 Our

3

14

64

YuanLiuuuuuu

@a33668874586

1 year

MMBench has been selected as one of the most influential papers at ECCV 2024, ranking second.🎉🎉🎉 https://t.co/oLXhoONN3P

0

1