TCEval-v2
收藏魔搭社区2026-01-02 更新2025-02-22 收录
下载链接:
https://modelscope.cn/datasets/MediaTek-Research/TCEval-v2
下载链接
链接失效反馈官方服务:
资源简介:
# TCEval v2
TCEval-v2 is a Traditional Chinese evaluation suite for foundation models derived from TCEval-v1. It covers 5 capabilities, including contextual QA, knowledge, classification, and table understanding.
## Benchmark
- **Contextual QA**
- **drcd** : DRCD is a Traditional Chinese machine reading comprehension dataset containing 10,014 paragraphs from 2,108 Wikipedia articles and over 30,000 questions.
- **Knowledge**
- **tmmluplus** (provided by MediaTek Research and iKala): Taiwan Massive Multitask Language Understanding + (TMMLU+) is curated from examinations in Taiwan, consisting of 67 subjects spanning across multiple disciplines, from vocational to academic fields, and covering elementary to professional proficiency levels. It is designed to identify a model’s knowledge and problem-solving blind spots similar to human evaluations. It is categorized into STEM, humanties, social sciences and other (similar to MMLU), for a higher level overview of the model capabilities.
- **Table Understanding**
- **penguin_table** (translate from a subset of [BIG-Bench](https://github.com/google/BIG-bench/tree/main/bigbench/benchmark_tasks/penguins_in_a_table)): The “penguins in a table” task contained in BIG-bench asks a language model to answer questions about the animals contained in a table, or multiple tables, described in the context.
- **Chat and instruction following**
- **mt_bench_tw** (translated from [MT Bench](https://huggingface.co/spaces/lmsys/mt-bench)): MT-Bench-TW is a Traditional Chinese version of MT-bench, which is a series of open-ended questions that evaluate a chatbot’s multi-turn conversational and instruction-following ability. MT-Bench-TW inherits the categorization of MT-Bench, which includes a wide variety of core capabilities, such as reasoning and writing.
If you find the dataset useful in your work, please cite:
```
@misc{hsu2023advancing,
title={Advancing the Evaluation of Traditional Chinese Language Models: Towards a Comprehensive Benchmark Suite},
author={Chan-Jan Hsu and Chang-Le Liu and Feng-Ting Liao and Po-Chun Hsu and Yi-Chang Chen and Da-shan Shiu},
year={2023},
eprint={2309.08448},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
```
# TCEval v2
TCEval-v2是一套源自TCEval-v1的繁体中文基础模型评测套件,涵盖上下文问答、知识、分类、表格理解与聊天及指令遵循共5项能力。
## 评测基准
- **上下文问答(Contextual QA)**
- **drcd**:DRCD是一个繁体中文机器阅读理解(Machine Reading Comprehension)数据集,包含来自2108篇维基百科文章的10014段文本与超30000个问题。
- **知识(Knowledge)**
- **tmmluplus**(由联发科技研究院(MediaTek Research)与iKala提供):台湾大规模多任务语言理解+(Taiwan Massive Multitask Language Understanding +,TMMLU+)数据集源自台湾地区各类考试,涵盖职业至学术领域的67个学科,覆盖小学至专业能力层级。其设计初衷与人类评测一致,用于识别模型的知识盲区与解题短板;参考多任务语言理解(MMLU)的分类框架,它被划分为STEM、人文、社会科学及其他类别,以实现对模型能力的多维度全景评估。
- **表格理解(Table Understanding)**
- **penguin_table**(源自[BIG-Bench](https://github.com/google/BIG-bench/tree/main/bigbench/benchmark_tasks/penguins_in_a_table)子集的翻译版本):BIG-bench中的“表格中的企鹅”任务要求语言模型依据上下文描述的单张或多张表格内的动物信息回答问题。
- **聊天与指令遵循(Chat and instruction following)**
- **mt_bench_tw**(源自[MT Bench](https://huggingface.co/spaces/lmsys/mt-bench)的翻译版本):MT-Bench-TW是MT-bench的繁体中文版本,由一系列开放式问题构成,用于评估聊天机器人的多轮对话与指令遵循能力。MT-Bench-TW继承了MT-Bench的分类体系,涵盖推理、写作等多项核心能力。
若您在研究工作中使用本数据集,请引用如下文献:
@misc{hsu2023advancing,
title={Advancing the Evaluation of Traditional Chinese Language Models: Towards a Comprehensive Benchmark Suite},
author={Chan-Jan Hsu and Chang-Le Liu and Feng-Ting Liao and Po-Chun Hsu and Yi-Chang Chen and Da-shan Shiu},
year={2023},
eprint={2309.08448},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
提供机构:
maas
创建时间:
2025-02-19



