five

meituan-longcat/General365_Public

收藏
Hugging Face2026-04-14 更新2026-05-10 收录
下载链接:
https://hf-mirror.com/datasets/meituan-longcat/General365_Public
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: mit configs: - config_name: default data_files: - split: test path: data/test-* dataset_info: features: - name: id dtype: string - name: question dtype: string - name: answer dtype: string - name: answer_type dtype: string - name: float_round dtype: string splits: - name: test num_bytes: 625873 num_examples: 720 download_size: 261438 dataset_size: 625873 task_categories: - question-answering language: - en tags: - reasoning - evaluation size_categories: - n<1K --- <div align=center><h1> 🧩 General365: Benchmarking General Reasoning in LLMs Across Diverse and Challenging Tasks </h1></div> <p align="center"> 📃 <a href="https://arxiv.org/abs/2604.11778" target="_blank">Paper</a > • 🌐 <a href="https://general365.github.io/" target="_blank">Project Page</a > • 🏆 <a href="https://general365.github.io/#Leaderboard" target="_blank">Leaderboard</a > • 💻 <a href="https://github.com/meituan-longcat/General365"_blank">Github</a > <br> </p > ## 📖 Introduction We present **General365**, a highly challenging and diverse benchmark for evaluating the general reasoning capabilities in LLMs. "General Reasoning" refers to reasoning tasks that depend exclusively on general knowledge. We define general knowledge as knowledge within the K-12 scope (such as common sense, fundamental linguistics, and basic subject matter), excluding university-level academic knowledge. Compared to domain-specific reasoning (e.g., Math Reasoning), general reasoning evaluation better decouples a model’s reasoning capability from its knowledge dependence. This enables a more precise assessment of reasoning skills rather than rote memorization, while testing the generalization of a model's reasoning abilities across broader scenarios. Current benchmarks for general reasoning face several challenges: a lack of difficulty, insufficient diversity, or overly synthetic characteristics. Consequently, we introduce **General365**, a manually curated benchmark characterized by high challenge and high diversity, aiming to facilitate more effective evaluation of reasoning capabilities in frontier models. > To ensure the impartiality of the evaluation, we have released only half of the total questions. The remaining questions are maintained as a held-out test set to track potential data contamination within the open-source part. ## 🌟 Key Features <p align="center"> <img src="https://github.com/meituan-longcat/General365/raw/main/figures/pipeline.png" width="800"> </p> - **High Diversity:** It contains 365 manually crafted, highly diverse seed problems, specifically designed to cover a wide range of reasoning challenges and avoid repetitive features or patterns. By altering surface semantics or constraints while preserving core reasoning skills, these seed problems were further expanded into 1,095 variants. - **Challenging Boundaries:** General365 covers 8 challenging categories, as detailed in Section 2.1 of paper. Even state-of- the-art models barely achieve a "passing" level of performance on these challenging tasks. - **Focus on Reasoning over Knowledge:** The knowledge required is strictly confined to the K-12 scope, ensuring the dataset measures a model’s reasoning capabilities rather than knowledge retrieval. - **Rigorous Quality Control:** All instances have undergone manual review to ensure the highest standards of quality. - **Accurate Scoring:** We implemented a hybrid scoring algorithm combining rule-based and model-based approaches, achieving a manually verified scoring accuracy of 99.6%. ## 🏆 Leaderboard <p align="center"> <img src="https://github.com/meituan-longcat/General365/raw/main/figures/general365_leaderboard.png" width="800"> </p> ## 📊 Main Results <p align="center"> <img src="https://github.com/meituan-longcat/General365/raw/main/figures/main_result.png" width="800"> </p> ## 🛠️ Quick Start ### Installation Clone the repository: ```bash git clone https://github.com/meituan-longcat/General365.git cd General365 ``` Install dependencies: ```bash pip install -r requirements.txt ``` ### Running evaluations #### Step 1: Prepaer the Model Response File After obtaining model responses, format them as follows (one JSON object per line): ```data {"question_id": 1, "model_response": "..."} {"question_id": 2, "model_response": "..."} ... ``` Save this file in the `./model_responses/` directory. #### Step 2: Grading Responses Set your API key and URL in lines 10-11 of `grading.py`. Then run: ```bash python grading.py --response_file example_responses.jsonl ``` Evaluation results will be saved under the `./grading_results/` directory. ## 🔎 Citation If you find our work helpful or relevant to your research, please kindly cite our paper: ``` @misc{general365benchmark, title={General365: Benchmarking General Reasoning in Large Language Models Across Diverse and Challenging Tasks}, author={Junlin Liu and Shengnan An and Shuang Zhou and Dan Ma and Shixiong Luo and Ying Xie and Yuan Zhang and Wenling Yuan and Yifan Zhou and Xiaoyu Li and Ziwen Wang and Xuezhi Cao and Xunliang Cai}, year={2026}, eprint={2604.11778}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2604.11778}, } ``` ## 🤗 Acknowledgement The evaluation script utilizes [Math-Verify](https://github.com/huggingface/Math-Verify) to parse and verify model outputs. We greatly appreciate the contributors' efforts in providing this valuable tool. ## 📜 License This project is licensed under the MIT License - see the [LICENSE](./LICENSE) file for details. ## 📪 Support For questions and support, please open an issue on GitHub or contact the maintainers.
提供机构:
meituan-longcat
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作