BigCode @BigCodeProject X Profile

BigCode

@BigCodeProject

Followers

9K

Following

229

Media

61

Statuses

272

Open and responsible research and development of large language models for code. #BigCodeProject run by @huggingface + @ServiceNowRSRCH

https://t.co/iGwiyXY0Cc

Joined August 2022

Don't wanna be here? Send us removal request.

BigCode

@BigCodeProject

2 years

Introducing: StarCoder2 and The Stack v2 ⭐️ StarCoder2 is trained with a 16k token context and repo-level information for 4T+ tokens. All built on The Stack v2 - the largest code dataset with 900B+ tokens. All code, data and models are fully open! https://t.co/fM7GinxJBd

13

202

670

Terry Yue Zhuo

@terryyuezhuo

2 months

It’s so much fun working with the other 39 community members on this project! Start to try out various frontier models in BigCodeArena today.

BigCode

@BigCodeProject

2 months

Introducing BigCodeArena, a human-in-the-loop platform for evaluating code through execution. Unlike current open evaluation platforms that collect human preferences on text, it enables interaction with runnable code to assess functionality and quality across any language.

11

37

130

BigCode

@BigCodeProject

2 months

- For more details, please check out the blog: https://t.co/TMyYpHA3lM - Try recent LLMs (e.g., Qwen3 series and DeepSeek-V3.2) on BigCodeArena now: https://t.co/KPk8q0wB8G - Paper Link: https://t.co/a9plXehamb - GitHub:

github.com

BigCodeArena: Unveiling More Reliable Human Preferences in Code Generation via Execution - bigcode-project/bigcodearena

0

2

7

BigCode

@BigCodeProject

2 months

BigCodeArena cannot be built without the support of the BigCode community. We are grateful for the huge credits provided by the @e2b team. We thank @hyperbolic_labs, @nvidia, and @Alibaba_Qwen for providing the model inference endpoints.

2

1

9

BigCode

@BigCodeProject

2 months

Beyond human votes, we release two new benchmarks: - BigCodeReward: tests reward models on 4.7K human preference votes. Execution feedback improves judgment accuracy. - AutoCodeArena: automated evaluation of 20+ LLMs. GPT-5 leads, followed by Claude-Opus-4 and Claude-Sonnet-4.

1

0

7

BigCode

@BigCodeProject

2 months

In 5 months, BigCodeArena collected 14K conversations & 4.7K preference votes across 10 frontier LLMs. Findings: 1. o3-mini & o1-mini consistently top Elo rankings 2. Claude-3.5-Sonnet excels in matched-language settings 3. Previous open models still lag

1

0

5

BigCode

@BigCodeProject

2 months

Why does this matter? Benchmarks like HumanEval only scratch the surface. Reading code “by eye” is error-prone. True quality emerges when you actually run it: web apps render, games play, edge cases break. BigCodeArena makes execution feedback the default.

1

0

6

BigCode

@BigCodeProject

2 months

Introducing BigCodeArena, a human-in-the-loop platform for evaluating code through execution. Unlike current open evaluation platforms that collect human preferences on text, it enables interaction with runnable code to assess functionality and quality across any language.

4

29

79

Terry Yue Zhuo

@terryyuezhuo

1 year

BigCodeBench @BigCodeProject evaluation framework has been fully upgraded! Just pip install -U bigcodebench With v0.2.0, it's now much easier to use compared to the previous v0.1.* versions. The new version adopts the @Gradio Client API interface from @huggingface Spaces by

1

8

33

Josh

@JoshPurtell

1 year

Evaluating LM agents has come a long way since gpt-4 released in March of 2023. We now have SWE-Bench, (Visual) Web Arena, and other evaluations that tell us a lot about how the best models + architectures do on hard and important tasks. There's still lots to do, though 🧵

2

11

44

Terry Yue Zhuo

@terryyuezhuo

1 year

People may think BigCodeBench @BigCodeProject is nothing more than a straightforward coding benchmark, but it is not. BigCodeBench is a rigorous testbed for LLM agents using code to solve complex and practical challenges. Each task demands significant reasoning capabilities for

5

9

42

Qian Liu

@sivil_taram

1 year

By popular demand, I have released the StarCoder2 code documentation dataset, please check it out ⬇️ https://t.co/jQ9xsmIH4e

huggingface.co

0

11

50

Arjun Guha

@ArjunGuha

1 year

This work will appear at OOPSLA 2024. New since last year: the StarCoder2 LLM from @BigCodeProject uses MultiPL-T as part of its pretraining corpus.

Arjun Guha

@ArjunGuha

2 years

LLMs are great at programming tasks... for Python and other very popular PLs. But, they are often unimpressive at artisanal PLs, like OCaml or Racket. We've come up with a way to significantly boost LLM performance of on low-resource languages. If you care about them, read on!

0

1

8

Terry Yue Zhuo

@terryyuezhuo

1 year

Today, we are happy to announce the beta mode of real-time Code Execution for BigCodeBench @BigCodeProject, which has been integrated into our Hugging Face leaderboard. We understand that setting up a dependency-based execution environment can be cumbersome, even with the

Terry Yue Zhuo

@terryyuezhuo

1 year

In the past few months, we’ve seen SOTA LLMs saturating basic coding benchmarks with short and simplified coding tasks. It's time to enter the next stage of coding challenge under comprehensive and realistic scenarios! -- Here comes BigCodeBench, benchmarking LLMs on solving

1

14

50

BigCode

@BigCodeProject

1 year

Releasing BigCodeBench-Hard: a subset of more challenging and user-facing tasks. BigCodeBench-Hard provides more accurate model performance evaluations and we also investigate some recent model updates. Read more: https://t.co/zc4NuIvIDX Leaderboard: https://t.co/zbTr3JPcCe

0

23

98

Rajiv Shah

@rajistics

1 year

BigCodeBench dataset🌸 Use it as inspiration when building your Generative AI evaluations. BigCodeBench h/t: @BigCodeProject @terryyuezhuo @lvwerra @clefourrier @huggingface (to name just a few of the people involved)

0

3

12

Terry Yue Zhuo

@terryyuezhuo

1 year

Ppl are curious about the performance of DeepSeek-Coder-V2-Lite on BigCodeBench. We've added its results, along with a few other models, to the leaderboard! https://t.co/EcaiPk7FcZ DeepSeek-Coder-V2-Lite-Instruct is a beast indeed, similar to Magicoder-S-DS-6.7B, but with only

BigCode

@BigCodeProject

1 year

Introducing 🌸BigCodeBench: Benchmarking Large Language Models on Solving Practical and Challenging Programming Tasks! BigCodeBench goes beyond simple evals like HumanEval and MBPP and tests LLMs on more realistic and challenging coding tasks.

0

8

24

Philipp Schmid

@_philschmid

1 year

It is time to deprecate HumanEval! 🧑🏻‍💻 @BigCodeProject just released BigCodeBench, a new benchmark to evaluate LLMs on challenging and complex coding tasks focused on realistic, function-level tasks that require the use of diverse libraries and complex reasoning! 👀 🧩 Contains

4

53

242

Terry Yue Zhuo

@terryyuezhuo

1 year

In the past few months, we’ve seen SOTA LLMs saturating basic coding benchmarks with short and simplified coding tasks. It's time to enter the next stage of coding challenge under comprehensive and realistic scenarios! -- Here comes BigCodeBench, benchmarking LLMs on solving

BigCode

@BigCodeProject

1 year

Introducing 🌸BigCodeBench: Benchmarking Large Language Models on Solving Practical and Challenging Programming Tasks! BigCodeBench goes beyond simple evals like HumanEval and MBPP and tests LLMs on more realistic and challenging coding tasks.

1

38

124

BigCode

@BigCodeProject

1 year

We release leaderboard, dataset, code, and paper: - 🤓 Blog: https://t.co/dRo1vRBLPV - 🌐 Website: https://t.co/oUm1EcpAjn - 🏆 Leaderboard: https://t.co/zbTr3JOEMG - 📚 Dataset: https://t.co/MNpQBJ2Mbk - 🛠️ Code: https://t.co/wJF09Yi3H2 - 📄 Paper:

0

3

24

BigCode

@BigCodeProject

1 year

BigCodeBench contains 1,140 function-level tasks to challenge LLMs to follow instructions and compose multiple function calls as tools from 139 Python libraries. To evaluate LLMs rigorously, each programming task encompasses 5.6 test cases with an average branch coverage of 99%.

1

0

18