meituan-longcat/General365_Public
收藏Hugging Face2026-04-14 更新2026-05-10 收录
下载链接:
https://hf-mirror.com/datasets/meituan-longcat/General365_Public
下载链接
链接失效反馈官方服务:
资源简介:
---
license: mit
configs:
- config_name: default
data_files:
- split: test
path: data/test-*
dataset_info:
features:
- name: id
dtype: string
- name: question
dtype: string
- name: answer
dtype: string
- name: answer_type
dtype: string
- name: float_round
dtype: string
splits:
- name: test
num_bytes: 625873
num_examples: 720
download_size: 261438
dataset_size: 625873
task_categories:
- question-answering
language:
- en
tags:
- reasoning
- evaluation
size_categories:
- n<1K
---
<div align=center><h1>
🧩 General365: Benchmarking General Reasoning in LLMs Across Diverse and Challenging Tasks
</h1></div>
<p align="center">
📃 <a href="https://arxiv.org/abs/2604.11778" target="_blank">Paper</a > • 🌐 <a href="https://general365.github.io/" target="_blank">Project Page</a > • 🏆 <a href="https://general365.github.io/#Leaderboard" target="_blank">Leaderboard</a > •
💻 <a href="https://github.com/meituan-longcat/General365"_blank">Github</a >
<br>
</p >
## 📖 Introduction
We present **General365**, a highly challenging and diverse benchmark for evaluating the general reasoning capabilities in LLMs.
"General Reasoning" refers to reasoning tasks that depend exclusively on general knowledge.
We define general knowledge as knowledge within the K-12 scope (such as common sense, fundamental linguistics, and basic subject matter), excluding university-level academic knowledge.
Compared to domain-specific reasoning (e.g., Math Reasoning), general reasoning evaluation better decouples a model’s reasoning capability from its knowledge dependence.
This enables a more precise assessment of reasoning skills rather than rote memorization, while testing the generalization of a model's reasoning abilities across broader scenarios.
Current benchmarks for general reasoning face several challenges: a lack of difficulty, insufficient diversity, or overly synthetic characteristics.
Consequently, we introduce **General365**, a manually curated benchmark characterized by high challenge and high diversity, aiming to facilitate more effective evaluation of reasoning capabilities in frontier models.
> To ensure the impartiality of the evaluation, we have released only half of the total questions. The remaining questions are maintained as a held-out test set to track potential data contamination within the open-source part.
## 🌟 Key Features
<p align="center">
<img src="https://github.com/meituan-longcat/General365/raw/main/figures/pipeline.png" width="800">
</p>
- **High Diversity:** It contains 365 manually crafted, highly diverse seed problems, specifically designed to cover a wide range of reasoning challenges and avoid repetitive features or patterns. By altering surface semantics or constraints while preserving core reasoning skills, these seed problems were further expanded into 1,095 variants.
- **Challenging Boundaries:** General365 covers 8 challenging categories, as detailed in Section 2.1 of paper. Even state-of-
the-art models barely achieve a "passing" level of performance on these challenging tasks.
- **Focus on Reasoning over Knowledge:** The knowledge required is strictly confined to the K-12 scope, ensuring the
dataset measures a model’s reasoning capabilities rather than knowledge retrieval.
- **Rigorous Quality Control:** All instances have undergone manual review to ensure the highest standards of quality.
- **Accurate Scoring:** We implemented a hybrid scoring algorithm combining rule-based and model-based approaches,
achieving a manually verified scoring accuracy of 99.6%.
## 🏆 Leaderboard
<p align="center">
<img src="https://github.com/meituan-longcat/General365/raw/main/figures/general365_leaderboard.png" width="800">
</p>
## 📊 Main Results
<p align="center">
<img src="https://github.com/meituan-longcat/General365/raw/main/figures/main_result.png" width="800">
</p>
## 🛠️ Quick Start
### Installation
Clone the repository:
```bash
git clone https://github.com/meituan-longcat/General365.git
cd General365
```
Install dependencies:
```bash
pip install -r requirements.txt
```
### Running evaluations
#### Step 1: Prepaer the Model Response File
After obtaining model responses, format them as follows (one JSON object per line):
```data
{"question_id": 1, "model_response": "..."}
{"question_id": 2, "model_response": "..."}
...
```
Save this file in the `./model_responses/` directory.
#### Step 2: Grading Responses
Set your API key and URL in lines 10-11 of `grading.py`.
Then run:
```bash
python grading.py --response_file example_responses.jsonl
```
Evaluation results will be saved under the `./grading_results/` directory.
## 🔎 Citation
If you find our work helpful or relevant to your research, please kindly cite our paper:
```
@misc{general365benchmark,
title={General365: Benchmarking General Reasoning in Large Language Models Across Diverse and Challenging Tasks},
author={Junlin Liu and Shengnan An and Shuang Zhou and Dan Ma and Shixiong Luo and Ying Xie and Yuan Zhang and Wenling Yuan and Yifan Zhou and Xiaoyu Li and Ziwen Wang and Xuezhi Cao and Xunliang Cai},
year={2026},
eprint={2604.11778},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2604.11778},
}
```
## 🤗 Acknowledgement
The evaluation script utilizes [Math-Verify](https://github.com/huggingface/Math-Verify) to parse and verify model outputs.
We greatly appreciate the contributors' efforts in providing this valuable tool.
## 📜 License
This project is licensed under the MIT License - see the [LICENSE](./LICENSE) file for details.
## 📪 Support
For questions and support, please open an issue on GitHub or contact the maintainers.
提供机构:
meituan-longcat



