Disclaimer: Evaluation can never be comprehensive, and any leaderboard could be hacked in unhealthy ways, this is especially concerning in the case of LLMs, where most models are not public. For example, good results could be achieved by distilling answers from strong models (e.g. GPT-4) or even humans, or by attempting to find and add the exact test data to the training set -- scores obtained in such ways are immediately meaningless. Therefore, we advise users to read the leaderboard with caution, and we split the leaderboard as (1) Models With Open Access -- these models have public weights or APIs and the users may verify their performance themselves; (2) Models with Limited Access: these models are not readily accessible to the public.
Results for different subjects and the average test results are shown below. The results are from either zero-shot or few-shot prompting (Model details including prompting format can be viewed by clicking into each model). You are welcome to submit your model's test predictions to C-Eval at any time (either zero-shot or few-shot eval is fine) and the submission system will automatically compute the scores. Click here to submit your results (your results will not be public on the leaderboard unless you request to do so).
(Note: * indicates that the model was evaluated by the C-Eval team, while other results are obtained through users' submitted predictions.)
# | Model | Creator | Access | Submission Date | Avg | Avg(Hard) | STEM | Social Science | Humanities | Others |
# | Model | Creator | Access | Submission Date | Avg | Avg(Hard) | STEM | Social Science | Humanities | Others |