41 Open-Source Large Language Models Benchmark Testing Report

Project Overview

This is a large-scale evaluation project of open-source large language models, using the lm-evaluation-harness library to conduct 19 benchmark tests on 41 open-source LLMs. All evaluations were completed locally on personal computers, demonstrating the performance of different models across various tasks.

Evaluation Framework

Test Categories

The benchmark tests are divided into three main categories:

1. Reasoning & Math

  • Tasks: gsm8k, bbh, arc_challenge, anli_r1/r2/r3, gpqa_main_zeroshot
  • Evaluation Metrics: Exact match, strict match, normalized accuracy, etc.

2. Commonsense & Natural Language Inference (NLI)

  • Tasks: hellaswag, piqa, winogrande, boolq, openbookqa, sciq, qnli
  • Evaluation Metrics: Normalized accuracy, accuracy, etc.

3. Knowledge & Reading Comprehension

  • Tasks: mmlu, nq_open, drop, truthfulqa_mc1/mc2, triviaqa
  • Evaluation Metrics: Accuracy, exact match, F1 score, etc.

Key Metrics Explanation

Model Naming Convention

  • Format: Company_ModelName
  • Quantized models marked with: (8bit)

Time Metrics

  • Total Time: System runtime to complete all benchmark tests
  • GPU Util Time: Equivalent RTX 5090 GPU time at 100% utilization

Scoring System

  • Mean Score: Arithmetic mean of all benchmark tasks
  • Score Range: 0-1, higher scores indicate better performance
  • Ranking: Calculated based on task average scores

Test Results Leaderboard

Overall Ranking (Top 10)

RankModel NameTotal TimeGPU Util TimeMean Score
1google_gemma-3-12b-it15h 45m14h 8m0.6038
2Qwen_Qwen3-14B (8bit)29h 45m17h 29m0.5961
3openchat_openchat-3.6-8b-202405227h 51m6h 59m0.5871
4Qwen_Qwen3-8B15h 31m13h 44m0.5859
5Qwen_Qwen2.5-7B-Instruct9h 36m8h 33m0.5788
6Qwen_Qwen2.5-14B-Instruct (8bit)52h 44m29h 32m0.5775
701-ai_Yi-1.5-9B11h 43m10h 26m0.5676
8Qwen_Qwen2.5-7B-Instruct-1M11h 17m10h 10m0.5672
9meta-llama_Llama-3.1-8B-Instruct12h 19m10h 52m0.5653
1001-ai_Yi-1.5-9B-Chat13h 54m12h 15m0.5621

Category Ranking Highlights

Reasoning & Math Performance Ranking (Top 5)

  1. google_gemma-3-12b-it (0.6266)
  2. Qwen_Qwen3-8B (0.6214)
  3. Qwen_Qwen3-14B (8bit) (0.586)
  4. Qwen_Qwen3-4B (0.5712)
  5. Qwen_Qwen2.5-7B-Instruct (0.5541)

Commonsense & NLI Ranking (Top 5)

  1. Qwen_Qwen2.5-14B-Instruct (8bit) (0.7941)
  2. Qwen_Qwen3-14B (8bit) (0.7807)
  3. google_gemma-3-12b-it (0.7737)
  4. Qwen_Qwen2.5-7B-Instruct (0.773)
  5. openchat_openchat-3.6-8b-20240522 (0.7726)

Knowledge & Reading Comprehension Ranking (Top 5)

  1. 01-ai_Yi-1.5-9B (0.4369)
  2. openchat_openchat-3.6-8b-20240522 (0.4136)
  3. meta-llama_Llama-3.1-8B-Instruct (0.4127)
  4. 01-ai_Yi-1.5-6B (0.4063)
  5. mistralai_Mistral-7B-Instruct-v0.3 (0.4045)

Key Findings

Performance Analysis

  • Google Gemma-3-12B-IT tops the overall ranking, particularly excelling in reasoning and math tasks
  • Qwen series models show strong performance across all categories, especially in commonsense reasoning
  • Yi series models excel in knowledge and reading comprehension tasks
  • Quantized models (8bit) maintain good performance while significantly reducing computational resource requirements

Efficiency Analysis

  • Smaller models can compete with larger models in certain specific tasks
  • GPU utilization time correlates positively with model size and complexity
  • Some medium-scale models demonstrate better cost-effectiveness

Project Resource Consumption

  • Total Machine Runtime: 18 days 8 hours
  • Equivalent GPU Time: 14 days 23 hours (RTX 5090 at 100% utilization)
  • Environmental Impact: Carbon neutralized through active use of public transportation 😊

Project Value

This comprehensive evaluation provides the open-source LLM community with:

  1. Objective performance comparison benchmarks
  2. Efficiency analysis of different scale models
  3. Task-specific model selection guidance
  4. Empirical data on quantization technique effectiveness

The complete data, scripts, and logs of this project have been open-sourced, providing valuable reference resources for researchers and developers.


Data Source: Hugging Face Spaces Leaderboard
Article Source: CurateClick

Tags:
Open Source
Large Language Models
Benchmark Testing
AI
Machine Learning
Back to Blog
Last updated: September 7, 2025