CHC-Bench
收藏魔搭社区2025-12-05 更新2024-05-15 收录
下载链接:
https://modelscope.cn/datasets/m-a-p/CHC-Bench
下载链接
链接失效反馈官方服务:
资源简介:
# Dataset Card for "CHC-Bench"
[**🌐 Homepage**](https://chinese-tiny-llm.github.io) | [**🤗 MAP-CC**](https://huggingface.co/datasets/m-a-p/MAP-CC) | [**🤗 CHC-Bench**](https://huggingface.co/datasets/m-a-p/CHC-Bench) | [**🤗 CT-LLM**](https://huggingface.co/collections/m-a-p/chinese-tiny-llm-660d0133dff6856f94ce0fc6) | [**📖 arXiv**](https://arxiv.org/abs/2404.04167) | [**GitHub**](https://github.com/Chinese-Tiny-LLM/Chinese-Tiny-LLM)
## Introduction
We propose a well-chosen multidisciplinary Chinese Hard Case Benchmark ([CHC-Bench](https://huggingface.co/datasets/m-a-p/CHC-Bench/)). We collect the problems from various sources e.g. [ziya](https://huggingface.co/datasets/IDEA-CCNL/Ziya-Writing-Eval-Chinese), [gaokao](https://huggingface.co/datasets/dmayhem93/agieval-gaokao-chinese), and [CIF-Bench](https://arxiv.org/html/2402.13109v1) to form hard-case Chinese instructions understanding and following evaluation benchmark (CHC-Bench in short) The categories of problems in CHC-Bench include writing, humanity and history, science, math, reading comprehension, role-playing, and hard cases of Chinese understanding (i.e. Chinese word pronunciation, ancient Chinese language understanding, etc.).
## Evaluation Method
Considering the limitations of 2-billion parameter models, our evaluation criteria go beyond just the accuracy of responses. We additionally consider factors such as usefulness, relevance, accuracy, depth, creativity, and the level of detail in the model’s answers. This comprehensive method allows for a detailed evaluation of the model’s response quality. Specifically, We use [GPT-4](https://arxiv.org/abs/2303.08774) to score responses from tested LLMs in specific problem contexts. We translate the score assignment prompt template from [MT-Bench](https://arxiv.org/pdf/2306.05685.pdf).
## Results

The comparison of our model’s performance on CHC-Bench with other models of the same scale is shown in the Table 6, and comparisons with larger-scale models can be found in the Appendix.E.3. In CHC-Benchone can assess the expertise of models in specific domains. For instance, Deepseek-coder-1.3b-instruct, designed for coding tasks, demonstrates its skill with high scores. The benchmarking results affirm the high quality of CHC-Benchin accurately reflecting models’ true capabilities. Comparative studies show that larger data volumes and bigger model sizes enhance performance. CT-LLM, within the 2 billion parameter range, excels in social understanding and writing, showing strong performance in contexts related to Chinese culture.
## Disclaimer
This model, developed for academic purposes, employs rigorously compliance-checked training data to uphold the highest standards of integrity and compliance. Despite our efforts, the inherent complexities of data and the broad spectrum of model applications prevent us from ensuring absolute accuracy or appropriateness of the model outputs in every scenario.
It is essential to highlight that our model and its associated training data are intended solely for scholarly research. We explicitly disclaim any liability for problems that may arise from improper use, interpretation errors, unlawful activities, the dissemination of false information, or any data security issues related to the utilization of our model or its training data.
We strongly encourage users to report any concerns related to data misuse, security breaches, or potential infringement issues directly to us for immediate investigation and resolution.
#### Contact: {`ge.zhang@uwaterloo.ca; duxinrun2000@gmail.com`}
Our commitment to responsible data sharing and the security of our academic tools is paramount. We thank you for your cooperation in maintaining the ethical use of this technology.
# "CHC-Bench"数据集卡片
[**🌐 项目主页**](https://chinese-tiny-llm.github.io) | [**🤗 MAP-CC 数据集**](https://huggingface.co/datasets/m-a-p/MAP-CC) | [**🤗 CHC-Bench 数据集**](https://huggingface.co/datasets/m-a-p/CHC-Bench) | [**🤗 CT-LLM 模型集合**](https://huggingface.co/collections/m-a-p/chinese-tiny-llm-660d0133dff6856f94ce0fc6) | [**📖 arXiv 论文**](https://arxiv.org/abs/2404.04167) | [**GitHub 仓库**](https://github.com/Chinese-Tiny-LLM/Chinese-Tiny-LLM)
## 引言
本研究提出了一个精心遴选的多学科中文硬案例基准测试集(CHC-Bench)。我们从[ziya](https://huggingface.co/datasets/IDEA-CCNL/Ziya-Writing-Eval-Chinese)、[gaokao](https://huggingface.co/datasets/dmayhem93/agieval-gaokao-chinese)以及[CIF-Bench](https://arxiv.org/html/2402.13109v1)等多个数据源采集问题样本,构建了面向中文指令理解与遵循的硬案例评估基准测试集(简称CHC-Bench)。CHC-Bench涵盖的问题类别包括写作、人文与历史、科学、数学、阅读理解、角色扮演,以及中文理解类硬案例(如中文词语发音、古汉语理解等)。
## 评估方法
考虑到20亿参数规模模型的性能局限,本研究的评估标准不再仅局限于模型回复的准确率,还会综合考量回复的实用性、相关性、准确性、深度、创造性以及细节丰富度等维度。该综合评估方法可实现对模型回复质量的精细化评测。具体而言,我们采用[GPT-4](https://arxiv.org/abs/2303.08774)针对特定问题场景对待测大语言模型(LLM)的回复进行评分,评分指令模板源自[MT-Bench](https://arxiv.org/pdf/2306.05685.pdf)的相关设计。
## 实验结果

本研究提出的模型在CHC-Bench上的性能表现与同规模其他模型的对比结果详见表6,与更大规模模型的性能对比可参见附录E.3。借助CHC-Bench,可评估模型在特定领域的专业能力。例如,专为编码任务设计的Deepseek-coder-1.3b-instruct模型便凭借高分展现了其在编码领域的专长。本次基准测试结果证实了CHC-Bench能够精准反映模型真实能力的高品质特性。对比研究表明,更大的训练数据规模与模型参数量均可提升模型性能。参数量处于20亿级别范围内的CT-LLM在社交理解与写作任务上表现优异,在与中国文化相关的场景中性能突出。
## 免责声明
本模型专为学术研究场景开发,采用经过严格合规性审核的训练数据,以确保最高标准的完整性与合规性。尽管我们已尽最大努力,但由于数据本身的复杂性以及模型应用场景的广泛性,我们无法保证模型输出在所有场景下均绝对准确或恰当。
需特别强调的是,本模型及其配套训练数据仅用于学术研究。对于因不当使用、解读偏差、非法活动、虚假信息传播,或与本模型及其训练数据使用相关的任何数据安全问题所引发的各类纠纷,我们明确不承担任何责任。
我们强烈鼓励用户直接向我们反馈任何与数据滥用、安全漏洞或潜在侵权相关的问题,以便我们及时开展调查并解决。
#### 联系方式
{`ge.zhang@uwaterloo.ca; duxinrun2000@gmail.com`}
我们始终将负责任的数据共享与学术工具的安全性置于首位。感谢您在本技术的合规使用方面给予的配合。
提供机构:
maas
创建时间:
2024-04-13



