41 Open-Source Large Language Models Benchmark Testing Report
Project Overview
This is a large-scale evaluation project of open-source large language models, using the lm-evaluation-harness library to conduct 19 benchmark tests on 41 open-source LLMs. All evaluations were completed locally on personal computers, demonstrating the performance of different models across various tasks.
Evaluation Framework
Test Categories
The benchmark tests are divided into three main categories:
1. Reasoning & Math
- Tasks: gsm8k, bbh, arc_challenge, anli_r1/r2/r3, gpqa_main_zeroshot
- Evaluation Metrics: Exact match, strict match, normalized accuracy, etc.
2. Commonsense & Natural Language Inference (NLI)
- Tasks: hellaswag, piqa, winogrande, boolq, openbookqa, sciq, qnli
- Evaluation Metrics: Normalized accuracy, accuracy, etc.
3. Knowledge & Reading Comprehension
- Tasks: mmlu, nq_open, drop, truthfulqa_mc1/mc2, triviaqa
- Evaluation Metrics: Accuracy, exact match, F1 score, etc.
Key Metrics Explanation
Model Naming Convention
- Format: Company_ModelName
- Quantized models marked with: (8bit)
Time Metrics
- Total Time: System runtime to complete all benchmark tests
- GPU Util Time: Equivalent RTX 5090 GPU time at 100% utilization
Scoring System
- Mean Score: Arithmetic mean of all benchmark tasks
- Score Range: 0-1, higher scores indicate better performance
- Ranking: Calculated based on task average scores
Test Results Leaderboard
Overall Ranking (Top 10)
Rank | Model Name | Total Time | GPU Util Time | Mean Score |
---|---|---|---|---|
1 | google_gemma-3-12b-it | 15h 45m | 14h 8m | 0.6038 |
2 | Qwen_Qwen3-14B (8bit) | 29h 45m | 17h 29m | 0.5961 |
3 | openchat_openchat-3.6-8b-20240522 | 7h 51m | 6h 59m | 0.5871 |
4 | Qwen_Qwen3-8B | 15h 31m | 13h 44m | 0.5859 |
5 | Qwen_Qwen2.5-7B-Instruct | 9h 36m | 8h 33m | 0.5788 |
6 | Qwen_Qwen2.5-14B-Instruct (8bit) | 52h 44m | 29h 32m | 0.5775 |
7 | 01-ai_Yi-1.5-9B | 11h 43m | 10h 26m | 0.5676 |
8 | Qwen_Qwen2.5-7B-Instruct-1M | 11h 17m | 10h 10m | 0.5672 |
9 | meta-llama_Llama-3.1-8B-Instruct | 12h 19m | 10h 52m | 0.5653 |
10 | 01-ai_Yi-1.5-9B-Chat | 13h 54m | 12h 15m | 0.5621 |
Category Ranking Highlights
Reasoning & Math Performance Ranking (Top 5)
- google_gemma-3-12b-it (0.6266)
- Qwen_Qwen3-8B (0.6214)
- Qwen_Qwen3-14B (8bit) (0.586)
- Qwen_Qwen3-4B (0.5712)
- Qwen_Qwen2.5-7B-Instruct (0.5541)
Commonsense & NLI Ranking (Top 5)
- Qwen_Qwen2.5-14B-Instruct (8bit) (0.7941)
- Qwen_Qwen3-14B (8bit) (0.7807)
- google_gemma-3-12b-it (0.7737)
- Qwen_Qwen2.5-7B-Instruct (0.773)
- openchat_openchat-3.6-8b-20240522 (0.7726)
Knowledge & Reading Comprehension Ranking (Top 5)
- 01-ai_Yi-1.5-9B (0.4369)
- openchat_openchat-3.6-8b-20240522 (0.4136)
- meta-llama_Llama-3.1-8B-Instruct (0.4127)
- 01-ai_Yi-1.5-6B (0.4063)
- mistralai_Mistral-7B-Instruct-v0.3 (0.4045)
Key Findings
Performance Analysis
- Google Gemma-3-12B-IT tops the overall ranking, particularly excelling in reasoning and math tasks
- Qwen series models show strong performance across all categories, especially in commonsense reasoning
- Yi series models excel in knowledge and reading comprehension tasks
- Quantized models (8bit) maintain good performance while significantly reducing computational resource requirements
Efficiency Analysis
- Smaller models can compete with larger models in certain specific tasks
- GPU utilization time correlates positively with model size and complexity
- Some medium-scale models demonstrate better cost-effectiveness
Project Resource Consumption
- Total Machine Runtime: 18 days 8 hours
- Equivalent GPU Time: 14 days 23 hours (RTX 5090 at 100% utilization)
- Environmental Impact: Carbon neutralized through active use of public transportation 😊
Project Value
This comprehensive evaluation provides the open-source LLM community with:
- Objective performance comparison benchmarks
- Efficiency analysis of different scale models
- Task-specific model selection guidance
- Empirical data on quantization technique effectiveness
The complete data, scripts, and logs of this project have been open-sourced, providing valuable reference resources for researchers and developers.
Data Source: Hugging Face Spaces Leaderboard
Article Source: CurateClick