CodeCriticBench: A Holistic Code Critique Benchmark for Large Language Models

1NJU, 2M-A-P, 3Alibaba, 4BUAA, 5CASIA, 6BUPT, 7Kuaishou, 8OPPO
*Equal Contribution, Corresponding Author
Teaser

Introduction

CodeCriticBench is the first holistic code critique benchmark for Large Language Models (LLMs), designed to evaluate their capacity to provide meaningful code feedback and suggestions. The benchmark encompasses both Code Generation and Code QA tasks across multiple difficulty levels, featuring comprehensive evaluation through Basic Critique Evaluation (BCE) and Advanced Critique Evaluation (ACE) protocols.

Dataset Statistics

CodeCriticBench comprises 4,300 high-quality samples organized into two main categories: Code Generation (3,200 samples) and Code QA (1,100 samples). The dataset is further stratified by difficulty levels: Easy (1,517 samples), Medium (1,084 samples), and Hard (1,699 samples), ensuring comprehensive coverage across different complexity levels.

Data Collection and Construction Pipeline

Data Collection Pipeline

Our data collection process involves multiple stages: for Code Generation, we gathered samples from CodeForces, MBPP, and LiveCodeBench, supplemented with expert-validated problems. For Code QA, we collected authentic requirements from StackOverflow and generated high-quality responses. Each sample includes fine-grained evaluation checklists across multiple dimensions, ensuring comprehensive assessment capabilities.

Dual Evaluation Framework

CodeCriticBench features a comprehensive dual evaluation system: Basic Critique Evaluation (BCE) measures binary correctness assessment capabilities, while Advanced Critique Evaluation (ACE) employs fine-grained checklists across 10 dimensions including correctness verification, code readability, robustness validation, and algorithm optimization. This multi-dimensional approach enables nuanced assessment of LLM critique capabilities.

Comprehensive Model Evaluation

Model Evaluation Results

We conducted extensive evaluation of 38 state-of-the-art LLMs, including both open-source models (Qwen2.5 series, Gemma3 series, Seed-Coder, DeepSeek series) and proprietary systems (GPT-4o, Claude 3.5 Sonnet, Gemini-2.5-Pro). Our results reveal significant performance variations across models, with o1-like models achieving notable improvements, though even the best performers struggle with the most challenging tasks.

Performance Across Difficulty Levels

Difficulty Analysis

Analysis across difficulty levels reveals that even top-performing models struggle to exceed 50% accuracy on the most challenging subset, indicating substantial room for improvement. The consistent ranking preservation across difficulty tiers demonstrates our benchmark's robust discriminative power, effectively distinguishing model capabilities at every complexity level.

Scaling Law and Error Analysis

Data Collection Pipeline

Our analysis confirms strong scaling effects, with larger models consistently outperforming smaller counterparts across all evaluation metrics. Additionally, we conducted detailed error type identification experiments, revealing that models struggle particularly with performance issues and security vulnerabilities, while showing better performance on syntax errors and logical inconsistencies.

BibTeX

@article{zhang2025codecriticbench,
        title={Codecriticbench: A holistic code critique benchmark for large language models},
        author={Zhang, Chenchen and Xia, Jinxiang and Liu, Jiaheng and Zhang, Wei and Wang, Yejie and Yang, Jian and Zhang, Ge and Liu, Tianyu and Peng, Zhongyuan and Tan, Yingshui and others},
        journal={arXiv preprint arXiv:2502.16614},
        year={2025}
      }