five

ChouBun

收藏
魔搭社区2025-12-04 更新2025-01-18 收录
下载链接:
https://modelscope.cn/datasets/SakanaAI/ChouBun
下载链接
链接失效反馈
官方服务:
资源简介:
# ChouBun ## Dataset Description **ChouBun** is a benchmark for assessing LLMs' performance in long-context tasks in the Japanese language. It is created and introduced in the paper [An Evolved Universal Transformer Memory](https://arxiv.org/abs/2410.13166). The benchmark includes documents from multiple websites and synthetic question-answer pairs generated by GPT-4 variants and Claude-3.5-Sonnet. The current version of ChouBun contains 2 task categories -- extractive QA and abstractive summarization -- and 4 tasks as shown below. - `wiki_qa` is an extractive QA task about 20 randomly sampled articles from the 20240429 dump of [Japanese Wikipedia](https://dumps.wikimedia.org/other/cirrussearch/). Each article corresponds to 10 QA pairs, and there are 200 QA pairs in total. - `edinet_qa` is an extractive QA task based on 20 security reports from [EDINET](https://disclosure2.edinet-fsa.go.jp/). The EDINET security reports are in CSV format. The total number of QA pairs is 390. - `corp_sec_qa` is another extractive QA task based on 30 security reports downloaded from three corporation websites ([MUFG](https://www.mufg.jp/ir/report/security_report/), [NTT](https://group.ntt/jp/ir/library/results/), and [Toyota](https://global.toyota/jp/ir/library/securities-report/)). We extract texts from original file in PDF format. There are 150 QA pairs in total. - `corp_sec_sum` is an abstractive summarization task based on the same data of `corp_sec_qa`. Each document corresponds to one data point, and we collected 5 reference summaries for each data point. ## Usage ```python from datasets import load_dataset datasets = ["wiki_qa", "edinet_qa", "corp_sec_qa", "corp_sec_sum"] for dataset in datasets: data = load_dataset("SakanaAI/ChouBun", dataset, split="test") ``` ## Data Format **ChouBun** adopts the same data format as in [THUDM/LongBench](https://huggingface.co/datasets/THUDM/LongBench/) and each example has the following fields. ```json { "input": "The input/command for the task, usually short, such as questions in QA, queries in Few-shot tasks, etc", "context": "The long context required for the task, such as documents, cross-file code, few-shot examples in Few-shot tasks", "answers": "A List of all true answers", "length": "Total length of the first three items (counted in characters for Chinese and words for English)", "dataset": "The name of the dataset to which this piece of data belongs", "language": "The language of this piece of data", "all_classes": "All categories in classification tasks, null for non-classification tasks", "_id": "Random id for each piece of data" } ``` ## Benchmark | Model (*max. input length*) | wiki_qa | editnet_qa | corp_sec_qa | corp_sec_sum | Overall | |:-------------------------------------|---------:|-----------:|------------:|-------------:|----------:| | mistralai/Mistral-7B-v0.1 (*32768*) | 8.68 | 8.34 | 16.25 | 10.50 | 10.94 | | rinna/llama-3-youko-8b (*8192*) | 16.68 | 12.23 | 17.03 | 22.27 | 17.05 | | meta-llama/Meta-Llama-3-8B (*8192*) | 14.58 | 14.77 | 16.86 | 22.84 | 17.27 | | meta-llama/Llama-2-7b-hf (*2048*) | 16.77 | 9.92 | 20.86 | 21.97 | 17.38 | | 01-ai/yi-6b-200k (*200000*)| 30.36 | 23.64 | 38.09 | 21.11 | 28.30 | | elyza/Llama-3-ELYZA-JP-8B (*8192*) | 20.77 | 21.45 | 35.59 | 40.21 | 29.50 | ## Citation ~~~ @article{sakana2024memory, title={An Evolved Universal Transformer Memory}, author={Edoardo Cetin and Qi Sun and Tianyu Zhao and Yujin Tang}, year={2024}, eprint={2410.13166}, archivePrefix={arXiv}, primaryClass={cs.LG}, url={https://arxiv.org/abs/2410.13166}, } ~~~

# ChouBun ## 数据集描述 **ChouBun** 是一款用于评估大语言模型(LLM)日语长上下文任务性能的基准测试集。该基准测试集由论文《An Evolved Universal Transformer Memory》(arXiv:2410.13166)构建并公开。其数据包含来自多个网站的文档,以及由GPT-4系列模型与Claude-3.5-Sonnet生成的合成问答对。当前版本的ChouBun包含两类任务:抽取式问答(extractive QA)与抽象式摘要(abstractive summarization),共包含如下4个子任务: - `wiki_qa`:基于[日本维基百科(Japanese Wikipedia)20240429版本转储](https://dumps.wikimedia.org/other/cirrussearch/)中随机采样的20篇文章构建的抽取式问答任务。每篇文章对应10组问答对,总计200组问答对。 - `edinet_qa`:基于[EDINET](https://disclosure2.edinet-fsa.go.jp/)的20份安全报告构建的抽取式问答任务。EDINET安全报告采用CSV格式,总计390组问答对。 - `corp_sec_qa`:另一项基于3家企业网站([三菱UFJ金融集团(MUFG)](https://www.mufg.jp/ir/report/security_report/)、[NTT集团](https://group.ntt/jp/ir/library/results/)和[丰田汽车(Toyota)](https://global.toyota/jp/ir/library/securities-report/))下载的30份安全报告构建的抽取式问答任务。我们从原始PDF文件中提取文本,总计150组问答对。 - `corp_sec_sum`:基于`corp_sec_qa`相同数据构建的抽象式摘要任务。每份文档对应一个数据样本,我们为每个样本收集了5份参考摘要。 ## 使用方法 python from datasets import load_dataset datasets = ["wiki_qa", "edinet_qa", "corp_sec_qa", "corp_sec_sum"] for dataset in datasets: data = load_dataset("SakanaAI/ChouBun", dataset, split="test") ## 数据格式 ChouBun 采用与[THUDM/LongBench](https://huggingface.co/datasets/THUDM/LongBench/)一致的数据格式,每个样本包含以下字段: json { "input": "该任务的输入/指令,通常较短,例如问答任务中的问题、少样本任务中的查询等", "context": "任务所需的长上下文内容,例如文档、跨文件代码、少样本任务中的示例等", "answers": "所有正确答案的列表", "length": "上述前三项内容的总长度(中文按字符数统计,英文按单词数统计)", "dataset": "该数据所属的数据集名称", "language": "该数据的语言", "all_classes": "分类任务中的所有类别,非分类任务则为null", "_id": "每个数据样本的随机唯一标识符" } ## 基准测试结果 | 模型(*最大输入长度*) | wiki_qa | edinet_qa | corp_sec_qa | corp_sec_sum | 整体得分 | |:-------------------------------------|---------:|-----------:|------------:|-------------:|----------:| | mistralai/Mistral-7B-v0.1 (*32768*) | 8.68 | 8.34 | 16.25 | 10.50 | 10.94 | | rinna/llama-3-youko-8b (*8192*) | 16.68 | 12.23 | 17.03 | 22.27 | 17.05 | | meta-llama/Meta-Llama-3-8B (*8192*) | 14.58 | 14.77 | 16.86 | 22.84 | 17.27 | | meta-llama/Llama-2-7b-hf (*2048*) | 16.77 | 9.92 | 20.86 | 21.97 | 17.38 | | 01-ai/yi-6b-200k (*200000*)| 30.36 | 23.64 | 38.09 | 21.11 | 28.30 | | elyza/Llama-3-ELYZA-JP-8B (*8192*) | 20.77 | 21.45 | 35.59 | 40.21 | 29.50 | ## 引用 bibtex @article{sakana2024memory, title={An Evolved Universal Transformer Memory}, author={Edoardo Cetin and Qi Sun and Tianyu Zhao and Yujin Tang}, year={2024}, eprint={2410.13166}, archivePrefix={arXiv}, primaryClass={cs.LG}, url={https://arxiv.org/abs/2410.13166}, }
提供机构:
maas
创建时间:
2025-01-17
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作