jordiclive/scored_summarization_datasets

Name: jordiclive/scored_summarization_datasets
Creator: jordiclive
Published: 2023-02-05 16:14:10
License: 暂无描述

Hugging Face2023-02-05 更新2024-03-04 收录

下载链接：

https://hf-mirror.com/datasets/jordiclive/scored_summarization_datasets

下载链接

链接失效反馈

官方服务：

资源简介：

# Dataset Card for "Scored-Summarization-datasets" A collection of Text summarization datasets geared towards training a multi-purpose text summarizer. Each dataset is a parquet file with the following features. #### default - `text`: a `string` feature. The `source` document - `summary`: a `string` feature. The summary of the document - `provenance`: a `string` feature. Information about the sub dataset. - `t5_text_token_count`: a `int64` feature. The number of tokens the text is encoded in. - `t5_summary_token_count `: a `int64` feature. The number of tokens the summary is encoded in. - `contriever_cos`: a `float64` feature. The Cosine Similarity of the Contriever text embedding and Contriever summary embedding. ### Sub-datasets - billsum - cnn_dailymail/3.0.0 - multixscience - newsroom - samsum - scitldr/AIC - tldr-challenge - wikihow - xsum Information about the Contriever model can be found here: https://github.com/facebookresearch/contriever.

提供机构：

jordiclive

原始信息汇总

数据集概述

数据集名称

Scored-Summarization-datasets

数据集目的

用于训练多用途文本摘要模型。

数据集结构

每个数据集以Parquet文件格式存储，包含以下特征：

默认特征

text: 字符串类型，源文档内容。
summary: 字符串类型，文档摘要。
provenance: 字符串类型，子数据集信息。
t5_text_token_count: 整数类型，文本编码的令牌数。
t5_summary_token_count: 整数类型，摘要编码的令牌数。
contriever_cos: 浮点数类型，Contriever文本嵌入与摘要嵌入的余弦相似度。

子数据集

billsum
cnn_dailymail/3.0.0
multixscience
newsroom
samsum
scitldr/AIC
tldr-challenge
wikihow
xsum

5,000+

优质数据集

54 个

任务类型

进入经典数据集