five

LCFO

收藏
魔搭社区2025-11-27 更新2025-05-24 收录
下载链接:
https://modelscope.cn/datasets/facebook/LCFO
下载链接
链接失效反馈
官方服务:
资源简介:
# LCFO: Long Context and Long Form Output Dataset This is a dataset for English longform summarization and summarization expansion. ## Dataset Details ### Dataset Description This is a dataset for English longform summarization and summarization expansion. It consists of 251 long documents (5K words on average) from 10 different domains, and their expert-written summaries of 3 different lengths: 20%, 10%, and 5% of the source document length. **NOTE: this is an early version of the dataset; it is going to be updated soon**. **NOTE: the source documents for most domains are not provided; they should be reconstructed. The instructions to be added soon.** - **Curated by:** [More Information Needed] - **Language(s) (NLP):** English - **License:** CC-BY-NC 4.0 (but the `source` column for the Wikipedia split is licensed under CC-BY-SA 4.0) - **Paper :** [LCFO: Long context and long form output dataset and benchmarking](https://arxiv.org/abs/2412.08268) ## Uses <!-- Address questions around how the dataset is intended to be used. --> ### Direct Use The dataset supports the following use cases: - Summarization (inclusing gradual summarization) - Summary expansion (generating a longer document that preserves the essential elements from the summary) - Reading comprehension with generative question answering - Evaluation of automatic quality metrics for summarization and summary expansion Being rather small, it is intended as a test dataset. ### Out-of-Scope Use The LCFO dataset is not inteded to be used as training data. ## Dataset Structure <!-- This section provides a description of the dataset fields, and additional information about the dataset structure such as criteria used to create the splits, relationships between data points, etc. --> The dataset consists of 3 tables: 1. `source_data`: description of the source documents, human-generated summaries, their alignment by paragraphs, and abstractive questions about the documents. 2. `summarization_eval`: the summaries of 3 different lengths (20%, 10%, 5%) generated for each document by humans and 3 models (GPT 4, Llama 3.1-70B, Llama 3.1-8B), and their human evaluation. 3. `summary_expansion_eval`: the documents re-generated by 3 models from the 20% summaries (for 4 domains), as well as their human evaluation. The tables are joinable by two fields present in each of them: `subset` (one of the 10 source datasets where the documents were drawn) and `item_id` (identifier of the document within a dataset). The `source_data` table has the following columns: - `subset (str)` : data source identifier - `item_id (str)`: document identifier - `source_text (str)`: the source document text (non-empty only for Wikipedia; needs to be reconstructed for other sources) - `src_paragraph_bounds (List[List[int]])`: pairs of start and end characters for each "paragraph' in the source document - `word_count (int)`: number of words in the source document - `summary_20, summary_10, summary_5 (str)`: human-generated summaries of the corresponding lengths - `summary_20_paragraphs, summary_10_paragraphs, summary_5_paragraphs (List[str])`: the same human summaries, split into paragraphs - `summary_20_offsets, summary_10_offsets, summary_5_offsets (List[str])`: indices of the source document paragraphs from which the information in each summary paragraph has been derived. The `summarization_eval` table has the following columns: - `subset (str)` : data source identifier - `item_id (str)`: document identifier - `model (str)` : summarization method identifier (including `human`) - `summary_20, summary_10, summary_5 (str)`: human- or machine-generated summaries of the corresponding lengths - `eval_20, eval_10, eval_5 (Dict)`: human evaluation of the corresponding summaries, including the following fields: - `s_2a, s_2b, s_2c, s_2d (List[int])`: evaluation of 4 quality aspects: attribution, coverage of the main ideas, conciseness and readability (on the scale from 0 to 4) - `s_3 (List[int])`: evaluation of the overall summarization quality (on the scale from 0 to 10) - `qa_1, qa_2, ... (List[str])`: whether the summary answers the corresponding question from `source_data` (`Yes` or `No`) Each summary is evaluated by several annotators (usually 3); each field represents a list of their responses. The `summary_expansion_eval` table has the following columns: - `subset (str)` : data source identifier - `item_id (str)`: document identifier - `model (str)` : summarization method identifier - `inverted_summ_20 (str)`: machine-generated expansion of the 20% summary - `eval_20, eval_10, eval_5 (Dict)`: human evaluation of the expanded summaries, including the following fields: - `r1 (str)`: whether the expanded summary is understandable - `r2a_lf, r2b_lf, r2c_lf, r2d_lf, r2e_lf, r2f_lf (int)`: evaluation of 6 quality aspects: coverage of main core ideas, cohesion, richness in details, creativity, non-repetitiveness, and interest, (on the scale from 0 to 4) - `s_3 (int)`: evaluation of the overall text quality (on the scale from 0 to 10) - `qa_1, qa_2, ... (str)`: whether the expanded summary answers the corresponding question from `source_data` (`YES` or `NO`) Each expansion is evaluated by several annotators (usually 3); each field represents a list of their responses. ## Dataset Creation Please read [the accompanying paper](https://arxiv.org/abs/2412.08268) about the source documents and the data annotation details. ## Reconstructing the source documents The instructions for reconstructing the source documents will be added soon. ## Citation ``` @article{lcfo, author = {Marta R. Costa-jussà and Pierre Andrews and Mariano Coria Megliogli and Joy Chen and Joe Chuang and David Dale and Christophe Ropers and Alex Mourachko and Eduardo Sánchez and Holger Schwenk and Tuan Tran and Arina Turkatenko and Carleigh Wood}, journal = {ArXiv}, title = {{LCFO}: Long Context and Long Form Output Dataset and Benchmarking}, year = {2024}, } ```

# LCFO(Long Context and Long Form Output Dataset):长上下文与长文本输出数据集 本数据集面向英文长文本摘要生成与摘要扩展任务。 ## 数据集详情 ### 数据集概况 本数据集用于英文长文本摘要生成与摘要扩展任务,包含来自10个不同领域的251篇长文档(平均字数为5000词),以及专家撰写的3种不同长度的摘要:分别为源文档长度的20%、10%与5%。 **注意:本数据集尚处于早期版本,即将进行更新。** **注意:多数领域的源文档未提供,需自行重构。相关使用说明即将上线。** - **整理者**:[更多信息待补充] - **语言(自然语言处理)**:英语 - **许可协议**:CC-BY-NC 4.0(但维基百科拆分数据集的`source`字段采用CC-BY-SA 4.0许可) - **相关论文**:[LCFO:长上下文与长文本输出数据集与基准测试](https://arxiv.org/abs/2412.08268) ## 数据集用途 ### 直接用途 本数据集支持以下应用场景: - 摘要生成(包括渐进式摘要生成) - 摘要扩展(生成保留摘要核心要素的更长文本) - 结合生成式问答的阅读理解任务 - 评估摘要生成与摘要扩展的自动质量指标 由于数据集规模较小,其定位为测试数据集。 ### 超出范围的用途 LCFO数据集不可用作训练数据。 ## 数据集结构 本数据集包含3张数据表: 1. `source_data`(源数据表):记录源文档的相关信息、人工生成的摘要、段落对齐结果,以及针对文档的生成式问答问题。 2. `summarization_eval`(摘要生成评估表):包含针对每篇文档由人工与3个模型(GPT-4、Llama 3.1-70B、Llama 3.1-8B)生成的3种长度(20%、10%、5%)的摘要,以及对应的人工评估结果。 3. `summary_expansion_eval`(摘要扩展评估表):包含由3个模型基于20%长度的摘要(仅针对4个领域)重构生成的文档,以及对应的人工评估结果。 三张数据表均可通过两个共通字段进行关联:`subset`(源文档所属的10个数据集之一)与`item_id`(单个数据集内的文档标识符)。 `source_data`数据表包含以下列: - `subset (str)`:数据源标识符 - `item_id (str)`:文档标识符 - `source_text (str)`:源文档文本(仅维基百科拆分数据集非空;其他数据源需自行重构) - `src_paragraph_bounds (List[List[int]])`:源文档中每个“段落”的起始与结束字符索引对 - `word_count (int)`:源文档的单词数 - `summary_20, summary_10, summary_5 (str)`:对应长度的人工生成摘要 - `summary_20_paragraphs, summary_10_paragraphs, summary_5_paragraphs (List[str])`:拆分至段落层面的上述人工摘要 - `summary_20_offsets, summary_10_offsets, summary_5_offsets (List[str])`:摘要每个段落所对应的源文档段落索引,用于标识信息来源。 `summarization_eval`数据表包含以下列: - `subset (str)`:数据源标识符 - `item_id (str)`:文档标识符 - `model (str)`:摘要生成方法标识符(包含`human`,即人工生成) - `summary_20, summary_10, summary_5 (str)`:人工或机器生成的对应长度摘要 - `eval_20, eval_10, eval_5 (Dict)`:对应摘要的人工评估结果,包含以下字段: - `s_2a, s_2b, s_2c, s_2d (List[int])`:对4个质量维度的评估:归因准确性、核心观点覆盖率、简洁性与可读性(评分范围为0至4) - `s_3 (List[int])`:对摘要整体质量的评估(评分范围为0至10) - `qa_1, qa_2, ... (List[str])`:评估摘要是否能正确回答`source_data`中对应的问答问题(回答为`Yes`或`No`) 每份摘要通常由3名标注者进行评估,上述每个字段均为标注者的响应列表。 `summary_expansion_eval`数据表包含以下列: - `subset (str)`:数据源标识符 - `item_id (str)`:文档标识符 - `model (str)`:摘要生成方法标识符 - `inverted_summ_20 (str)`:基于20%长度摘要生成的机器扩展文本 - `eval_20, eval_10, eval_5 (Dict)`:扩展文本的人工评估结果,包含以下字段: - `r1 (str)`:扩展文本是否易于理解 - `r2a_lf, r2b_lf, r2c_lf, r2d_lf, r2e_lf, r2f_lf (int)`:对6个质量维度的评估:核心观点覆盖率、连贯性、细节丰富度、创造性、非重复性与趣味性(评分范围为0至4) - `s_3 (int)`:对扩展文本整体质量的评估(评分范围为0至10) - `qa_1, qa_2, ... (str)`:评估扩展文本是否能正确回答`source_data`中对应的问答问题(回答为`YES`或`NO`) 每份扩展文本通常由3名标注者进行评估,上述每个字段均为标注者的响应列表。 ## 数据集构建 请参阅[配套论文](https://arxiv.org/abs/2412.08268)了解源文档与数据标注的详细信息。 ## 源文档重构 源文档重构的相关说明即将上线。 ## 引用格式 @article{lcfo, author = {Marta R. Costa-jussà and Pierre Andrews and Mariano Coria Megliogli and Joy Chen and Joe Chuang and David Dale and Christophe Ropers and Alex Mourachko and Eduardo Sánchez and Holger Schwenk and Tuan Tran and Arina Turkatenko and Carleigh Wood}, journal = {ArXiv}, title = {{LCFO}: Long Context and Long Form Output Dataset and Benchmarking}, year = {2024}, }
提供机构:
maas
创建时间:
2025-05-20
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作