five

trueteacher

收藏
魔搭社区2025-12-05 更新2025-04-26 收录
下载链接:
https://modelscope.cn/datasets/google/trueteacher
下载链接
链接失效反馈
官方服务:
资源简介:
# **TrueTeacher** ## Dataset Summary This is a large-scale synthetic dataset for training **Factual Consistency Evaluation** models, introduced in the [TrueTeacher paper (Gekhman et al, 2023)](https://aclanthology.org/2023.emnlp-main.127.pdf). ## Dataset Details The dataset contains model-generated summaries of articles from the train split of the **CNN/DailyMail** dataset [(Hermann et al., 2015)](https://proceedings.neurips.cc/paper_files/paper/2015/file/afdec7005cc9f14302cd0474fd0f3c96-Paper.pdf) which are annotated for factual consistency using **FLAN-PaLM 540B** [(Chung et al.,2022)](https://arxiv.org/pdf/2210.11416.pdf). Summaries were generated using summarization models with different capacities, which were created by fine-tuning **T5** [(Raffel et al., 2020)](https://jmlr.org/papers/volume21/20-074/20-074.pdf) on the **XSum** dataset [(Narayan et al., 2018)](https://aclanthology.org/D18-1206.pdf). We used the following 5 capacities: T5-11B, T5-3B, T5-large, T5-base and T5-small. ## Data format The data contains json lines with the following keys: - `"summarization_model"` - The summarization model used to generate the summary. - `"cnndm_id"` - The original id from the CNN/DailyMail dataset, this need to be used in order to retrieve the corresponding article from CNN/DailyMail (which was used as the grounding document). - `"summary"` - The model-generated summary. - `"label"` - A binary label ('1' - Factualy Consistent, '0' - Factualy Inconsistent). Here is an example of a single data item: ```json { "summarization_model": "T5-11B", "cnndm_id": "f72048a23154de8699c307e2f41157abbfcae261", "summary": "Children's brains are being damaged by prolonged internet access, a former children's television presenter has warned." "label": "1", } ``` ## Loading the dataset To use the dataset, you need to fetch the relevant documents from the CNN/DailyMail dataset. The follwoing code can be used for that purpose: ```python from datasets import load_dataset from tqdm import tqdm trueteacher_data = load_dataset("google/trueteacher", split='train') cnn_dailymail_data = load_dataset("cnn_dailymail", version="3.0.0", split='train') cnn_dailymail_articles_by_id = {example['id']: example['article'] for example in cnn_dailymail_data} trueteacher_data_with_documents = [] for example in tqdm(trueteacher_data):   example['document'] = cnn_dailymail_articles_by_id[example['cnndm_id']]   trueteacher_data_with_documents.append(example) ``` ## Intended Use This dataset is intended for a research use (**non-commercial**) in English. The recommended use case is training factual consistency evaluation models for summarization. ## Out-of-scope use Any use cases which violate the **cc-by-nc-4.0** license. Usage in languages other than English. ## Citation If you use this dataset for a research publication, please cite the TrueTeacher paper (using the bibtex entry below), as well as the CNN/DailyMail, XSum, T5 and FLAN papers mentioned above. ``` @misc{gekhman2023trueteacher, title={TrueTeacher: Learning Factual Consistency Evaluation with Large Language Models}, author={Zorik Gekhman and Jonathan Herzig and Roee Aharoni and Chen Elkind and Idan Szpektor}, year={2023}, eprint={2305.11171}, archivePrefix={arXiv}, primaryClass={cs.CL} } ```

# **TrueTeacher** ## 数据集概述 本数据集是一款用于训练**事实一致性评估(Factual Consistency Evaluation)**模型的大规模合成数据集,源自《TrueTeacher》论文(Gekhman等人,2023)<https://aclanthology.org/2023.emnlp-main.127.pdf>。 ## 数据集详情 本数据集包含源自**CNN/DailyMail**数据集(Hermann等人,2015)<https://proceedings.neurips.cc/paper_files/paper/2015/file/afdec7005cc9f14302cd0474fd0f3c96-Paper.pdf>训练划分的文章的模型生成摘要,并使用**FLAN-PaLM 540B**(Chung等人,2022)<https://arxiv.org/pdf/2210.11416.pdf>对这些摘要进行事实一致性标注。 摘要由不同参数量的摘要生成模型生成,这些模型是通过在**XSum**数据集(Narayan等人,2018)<https://aclanthology.org/D18-1206.pdf>上微调**T5**(Raffel等人,2020)<https://jmlr.org/papers/volume21/20-074/20-074.pdf>得到的。本次实验使用了5种参数量的模型:T5-11B、T5-3B、T5-large、T5-base以及T5-small。 ## 数据格式 本数据集采用JSON Lines格式存储,包含以下字段: - `"summarization_model"`:用于生成摘要的摘要模型名称 - `"cnndm_id"`:**CNN/DailyMail**数据集的原始ID,可通过该ID从原数据集中检索对应的基准文档 - `"summary"`:模型生成的摘要文本 - `"label"`:二分类标签,`'1'`代表事实一致,`'0'`代表事实不一致(原文笔误:Factualy应为Factually) 以下为单条数据示例: json { "summarization_model": "T5-11B", "cnndm_id": "f72048a23154de8699c307e2f41157abbfcae261", "summary": "Children's brains are being damaged by prolonged internet access, a former children's television presenter has warned.", "label": "1" } ## 数据集加载 若要使用本数据集,需从**CNN/DailyMail**数据集中获取对应的源文档,可通过以下代码实现: python from datasets import load_dataset from tqdm import tqdm trueteacher_data = load_dataset("google/trueteacher", split='train') cnn_dailymail_data = load_dataset("cnn_dailymail", version="3.0.0", split='train') cnn_dailymail_articles_by_id = {example['id']: example['article'] for example in cnn_dailymail_data} trueteacher_data_with_documents = [] for example in tqdm(trueteacher_data): example['document'] = cnn_dailymail_articles_by_id[example['cnndm_id']] trueteacher_data_with_documents.append(example) ## 预期用途 本数据集仅面向英文场景下的非商业研究用途,推荐用于训练摘要任务的事实一致性评估模型。 ## 禁止使用场景 任何违反**cc-by-nc-4.0**许可协议的使用场景,以及非英文语言下的使用,均属于禁止范围。 ## 引用说明 若您将本数据集用于学术发表,请引用《TrueTeacher》论文(即下方的BibTeX条目),同时一并引用本文提及的CNN/DailyMail、XSum、T5及FLAN相关论文。 bibtex @misc{gekhman2023trueteacher, title={TrueTeacher: Learning Factual Consistency Evaluation with Large Language Models}, author={Zorik Gekhman and Jonathan Herzig and Roee Aharoni and Chen Elkind and Idan Szpektor}, year={2023}, eprint={2305.11171}, archivePrefix={arXiv}, primaryClass={cs.CL} }
提供机构:
maas
创建时间:
2025-04-21
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作