ChouBun
收藏魔搭社区2025-12-04 更新2025-01-18 收录
下载链接:
https://modelscope.cn/datasets/SakanaAI/ChouBun
下载链接
链接失效反馈官方服务:
资源简介:
# ChouBun
## Dataset Description
**ChouBun** is a benchmark for assessing LLMs' performance in long-context tasks in the Japanese language.
It is created and introduced in the paper [An Evolved Universal Transformer Memory](https://arxiv.org/abs/2410.13166).
The benchmark includes documents from multiple websites and synthetic question-answer pairs generated by GPT-4 variants and Claude-3.5-Sonnet.
The current version of ChouBun contains 2 task categories -- extractive QA and abstractive summarization -- and 4 tasks as shown below.
- `wiki_qa` is an extractive QA task about 20 randomly sampled articles from the 20240429 dump of [Japanese Wikipedia](https://dumps.wikimedia.org/other/cirrussearch/). Each article corresponds to 10 QA pairs, and there are 200 QA pairs in total.
- `edinet_qa` is an extractive QA task based on 20 security reports from [EDINET](https://disclosure2.edinet-fsa.go.jp/). The EDINET security reports are in CSV format. The total number of QA pairs is 390.
- `corp_sec_qa` is another extractive QA task based on 30 security reports downloaded from three corporation websites ([MUFG](https://www.mufg.jp/ir/report/security_report/), [NTT](https://group.ntt/jp/ir/library/results/), and [Toyota](https://global.toyota/jp/ir/library/securities-report/)). We extract texts from original file in PDF format. There are 150 QA pairs in total.
- `corp_sec_sum` is an abstractive summarization task based on the same data of `corp_sec_qa`. Each document corresponds to one data point, and we collected 5 reference summaries for each data point.
## Usage
```python
from datasets import load_dataset
datasets = ["wiki_qa", "edinet_qa", "corp_sec_qa", "corp_sec_sum"]
for dataset in datasets:
data = load_dataset("SakanaAI/ChouBun", dataset, split="test")
```
## Data Format
**ChouBun** adopts the same data format as in [THUDM/LongBench](https://huggingface.co/datasets/THUDM/LongBench/) and each example has the following fields.
```json
{
"input": "The input/command for the task, usually short, such as questions in QA, queries in Few-shot tasks, etc",
"context": "The long context required for the task, such as documents, cross-file code, few-shot examples in Few-shot tasks",
"answers": "A List of all true answers",
"length": "Total length of the first three items (counted in characters for Chinese and words for English)",
"dataset": "The name of the dataset to which this piece of data belongs",
"language": "The language of this piece of data",
"all_classes": "All categories in classification tasks, null for non-classification tasks",
"_id": "Random id for each piece of data"
}
```
## Benchmark
| Model (*max. input length*) | wiki_qa | editnet_qa | corp_sec_qa | corp_sec_sum | Overall |
|:-------------------------------------|---------:|-----------:|------------:|-------------:|----------:|
| mistralai/Mistral-7B-v0.1 (*32768*) | 8.68 | 8.34 | 16.25 | 10.50 | 10.94 |
| rinna/llama-3-youko-8b (*8192*) | 16.68 | 12.23 | 17.03 | 22.27 | 17.05 |
| meta-llama/Meta-Llama-3-8B (*8192*) | 14.58 | 14.77 | 16.86 | 22.84 | 17.27 |
| meta-llama/Llama-2-7b-hf (*2048*) | 16.77 | 9.92 | 20.86 | 21.97 | 17.38 |
| 01-ai/yi-6b-200k (*200000*)| 30.36 | 23.64 | 38.09 | 21.11 | 28.30 |
| elyza/Llama-3-ELYZA-JP-8B (*8192*) | 20.77 | 21.45 | 35.59 | 40.21 | 29.50 |
## Citation
~~~
@article{sakana2024memory,
title={An Evolved Universal Transformer Memory},
author={Edoardo Cetin and Qi Sun and Tianyu Zhao and Yujin Tang},
year={2024},
eprint={2410.13166},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/abs/2410.13166},
}
~~~
# ChouBun
## 数据集描述
**ChouBun** 是一款用于评估大语言模型(LLM)日语长上下文任务性能的基准测试集。该基准测试集由论文《An Evolved Universal Transformer Memory》(arXiv:2410.13166)构建并公开。其数据包含来自多个网站的文档,以及由GPT-4系列模型与Claude-3.5-Sonnet生成的合成问答对。当前版本的ChouBun包含两类任务:抽取式问答(extractive QA)与抽象式摘要(abstractive summarization),共包含如下4个子任务:
- `wiki_qa`:基于[日本维基百科(Japanese Wikipedia)20240429版本转储](https://dumps.wikimedia.org/other/cirrussearch/)中随机采样的20篇文章构建的抽取式问答任务。每篇文章对应10组问答对,总计200组问答对。
- `edinet_qa`:基于[EDINET](https://disclosure2.edinet-fsa.go.jp/)的20份安全报告构建的抽取式问答任务。EDINET安全报告采用CSV格式,总计390组问答对。
- `corp_sec_qa`:另一项基于3家企业网站([三菱UFJ金融集团(MUFG)](https://www.mufg.jp/ir/report/security_report/)、[NTT集团](https://group.ntt/jp/ir/library/results/)和[丰田汽车(Toyota)](https://global.toyota/jp/ir/library/securities-report/))下载的30份安全报告构建的抽取式问答任务。我们从原始PDF文件中提取文本,总计150组问答对。
- `corp_sec_sum`:基于`corp_sec_qa`相同数据构建的抽象式摘要任务。每份文档对应一个数据样本,我们为每个样本收集了5份参考摘要。
## 使用方法
python
from datasets import load_dataset
datasets = ["wiki_qa", "edinet_qa", "corp_sec_qa", "corp_sec_sum"]
for dataset in datasets:
data = load_dataset("SakanaAI/ChouBun", dataset, split="test")
## 数据格式
ChouBun 采用与[THUDM/LongBench](https://huggingface.co/datasets/THUDM/LongBench/)一致的数据格式,每个样本包含以下字段:
json
{
"input": "该任务的输入/指令,通常较短,例如问答任务中的问题、少样本任务中的查询等",
"context": "任务所需的长上下文内容,例如文档、跨文件代码、少样本任务中的示例等",
"answers": "所有正确答案的列表",
"length": "上述前三项内容的总长度(中文按字符数统计,英文按单词数统计)",
"dataset": "该数据所属的数据集名称",
"language": "该数据的语言",
"all_classes": "分类任务中的所有类别,非分类任务则为null",
"_id": "每个数据样本的随机唯一标识符"
}
## 基准测试结果
| 模型(*最大输入长度*) | wiki_qa | edinet_qa | corp_sec_qa | corp_sec_sum | 整体得分 |
|:-------------------------------------|---------:|-----------:|------------:|-------------:|----------:|
| mistralai/Mistral-7B-v0.1 (*32768*) | 8.68 | 8.34 | 16.25 | 10.50 | 10.94 |
| rinna/llama-3-youko-8b (*8192*) | 16.68 | 12.23 | 17.03 | 22.27 | 17.05 |
| meta-llama/Meta-Llama-3-8B (*8192*) | 14.58 | 14.77 | 16.86 | 22.84 | 17.27 |
| meta-llama/Llama-2-7b-hf (*2048*) | 16.77 | 9.92 | 20.86 | 21.97 | 17.38 |
| 01-ai/yi-6b-200k (*200000*)| 30.36 | 23.64 | 38.09 | 21.11 | 28.30 |
| elyza/Llama-3-ELYZA-JP-8B (*8192*) | 20.77 | 21.45 | 35.59 | 40.21 | 29.50 |
## 引用
bibtex
@article{sakana2024memory,
title={An Evolved Universal Transformer Memory},
author={Edoardo Cetin and Qi Sun and Tianyu Zhao and Yujin Tang},
year={2024},
eprint={2410.13166},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/abs/2410.13166},
}
提供机构:
maas
创建时间:
2025-01-17



