trueteacher

Name: trueteacher
Creator: maas
Published: 2025-12-05 12:14:07
License: 暂无描述

魔搭社区2025-12-05 更新2025-04-26 收录

下载链接：

https://modelscope.cn/datasets/google/trueteacher

下载链接

链接失效反馈

官方服务：

资源简介：

# **TrueTeacher** ## Dataset Summary This is a large-scale synthetic dataset for training **Factual Consistency Evaluation** models, introduced in the [TrueTeacher paper (Gekhman et al, 2023)](https://aclanthology.org/2023.emnlp-main.127.pdf). ## Dataset Details The dataset contains model-generated summaries of articles from the train split of the **CNN/DailyMail** dataset [(Hermann et al., 2015)](https://proceedings.neurips.cc/paper_files/paper/2015/file/afdec7005cc9f14302cd0474fd0f3c96-Paper.pdf) which are annotated for factual consistency using **FLAN-PaLM 540B** [(Chung et al.,2022)](https://arxiv.org/pdf/2210.11416.pdf). Summaries were generated using summarization models with different capacities, which were created by fine-tuning **T5** [(Raffel et al., 2020)](https://jmlr.org/papers/volume21/20-074/20-074.pdf) on the **XSum** dataset [(Narayan et al., 2018)](https://aclanthology.org/D18-1206.pdf). We used the following 5 capacities: T5-11B, T5-3B, T5-large, T5-base and T5-small. ## Data format The data contains json lines with the following keys: - `"summarization_model"` - The summarization model used to generate the summary. - `"cnndm_id"` - The original id from the CNN/DailyMail dataset, this need to be used in order to retrieve the corresponding article from CNN/DailyMail (which was used as the grounding document). - `"summary"` - The model-generated summary. - `"label"` - A binary label ('1' - Factualy Consistent, '0' - Factualy Inconsistent). Here is an example of a single data item: ```json { "summarization_model": "T5-11B", "cnndm_id": "f72048a23154de8699c307e2f41157abbfcae261", "summary": "Children's brains are being damaged by prolonged internet access, a former children's television presenter has warned." "label": "1", } ``` ## Loading the dataset To use the dataset, you need to fetch the relevant documents from the CNN/DailyMail dataset. The follwoing code can be used for that purpose: ```python from datasets import load_dataset from tqdm import tqdm trueteacher_data = load_dataset("google/trueteacher", split='train') cnn_dailymail_data = load_dataset("cnn_dailymail", version="3.0.0", split='train') cnn_dailymail_articles_by_id = {example['id']: example['article'] for example in cnn_dailymail_data} trueteacher_data_with_documents = [] for example in tqdm(trueteacher_data): example['document'] = cnn_dailymail_articles_by_id[example['cnndm_id']] trueteacher_data_with_documents.append(example) ``` ## Intended Use This dataset is intended for a research use (**non-commercial**) in English. The recommended use case is training factual consistency evaluation models for summarization. ## Out-of-scope use Any use cases which violate the **cc-by-nc-4.0** license. Usage in languages other than English. ## Citation If you use this dataset for a research publication, please cite the TrueTeacher paper (using the bibtex entry below), as well as the CNN/DailyMail, XSum, T5 and FLAN papers mentioned above. ``` @misc{gekhman2023trueteacher, title={TrueTeacher: Learning Factual Consistency Evaluation with Large Language Models}, author={Zorik Gekhman and Jonathan Herzig and Roee Aharoni and Chen Elkind and Idan Szpektor}, year={2023}, eprint={2305.11171}, archivePrefix={arXiv}, primaryClass={cs.CL} } ```

# **TrueTeacher** ## 数据集概述本数据集是一款用于训练**事实一致性评估（Factual Consistency Evaluation）**模型的大规模合成数据集，源自《TrueTeacher》论文（Gekhman等人，2023）<https://aclanthology.org/2023.emnlp-main.127.pdf>。 ## 数据集详情本数据集包含源自**CNN/DailyMail**数据集（Hermann等人，2015）<https://proceedings.neurips.cc/paper_files/paper/2015/file/afdec7005cc9f14302cd0474fd0f3c96-Paper.pdf>训练划分的文章的模型生成摘要，并使用**FLAN-PaLM 540B**（Chung等人，2022）<https://arxiv.org/pdf/2210.11416.pdf>对这些摘要进行事实一致性标注。摘要由不同参数量的摘要生成模型生成，这些模型是通过在**XSum**数据集（Narayan等人，2018）<https://aclanthology.org/D18-1206.pdf>上微调**T5**（Raffel等人，2020）<https://jmlr.org/papers/volume21/20-074/20-074.pdf>得到的。本次实验使用了5种参数量的模型：T5-11B、T5-3B、T5-large、T5-base以及T5-small。 ## 数据格式本数据集采用JSON Lines格式存储，包含以下字段： - `"summarization_model"`：用于生成摘要的摘要模型名称 - `"cnndm_id"`：**CNN/DailyMail**数据集的原始ID，可通过该ID从原数据集中检索对应的基准文档 - `"summary"`：模型生成的摘要文本 - `"label"`：二分类标签，`'1'`代表事实一致，`'0'`代表事实不一致（原文笔误：Factualy应为Factually）以下为单条数据示例： json { "summarization_model": "T5-11B", "cnndm_id": "f72048a23154de8699c307e2f41157abbfcae261", "summary": "Children's brains are being damaged by prolonged internet access, a former children's television presenter has warned.", "label": "1" } ## 数据集加载若要使用本数据集，需从**CNN/DailyMail**数据集中获取对应的源文档，可通过以下代码实现： python from datasets import load_dataset from tqdm import tqdm trueteacher_data = load_dataset("google/trueteacher", split='train') cnn_dailymail_data = load_dataset("cnn_dailymail", version="3.0.0", split='train') cnn_dailymail_articles_by_id = {example['id']: example['article'] for example in cnn_dailymail_data} trueteacher_data_with_documents = [] for example in tqdm(trueteacher_data): example['document'] = cnn_dailymail_articles_by_id[example['cnndm_id']] trueteacher_data_with_documents.append(example) ## 预期用途本数据集仅面向英文场景下的非商业研究用途，推荐用于训练摘要任务的事实一致性评估模型。 ## 禁止使用场景任何违反**cc-by-nc-4.0**许可协议的使用场景，以及非英文语言下的使用，均属于禁止范围。 ## 引用说明若您将本数据集用于学术发表，请引用《TrueTeacher》论文（即下方的BibTeX条目），同时一并引用本文提及的CNN/DailyMail、XSum、T5及FLAN相关论文。 bibtex @misc{gekhman2023trueteacher, title={TrueTeacher: Learning Factual Consistency Evaluation with Large Language Models}, author={Zorik Gekhman and Jonathan Herzig and Roee Aharoni and Chen Elkind and Idan Szpektor}, year={2023}, eprint={2305.11171}, archivePrefix={arXiv}, primaryClass={cs.CL} }

提供机构：

maas

创建时间：

2025-04-21

5,000+

优质数据集

54 个

任务类型

进入经典数据集