sent_comp
收藏魔搭社区2025-07-11 更新2025-07-12 收录
下载链接:
https://modelscope.cn/datasets/google-research-datasets/sent_comp
下载链接
链接失效反馈官方服务:
资源简介:
# Dataset Card for Google Sentence Compression
## Table of Contents
- [Dataset Description](#dataset-description)
- [Dataset Summary](#dataset-summary)
- [Supported Tasks and Leaderboards](#supported-tasks-and-leaderboards)
- [Languages](#languages)
- [Dataset Structure](#dataset-structure)
- [Data Instances](#data-instances)
- [Data Fields](#data-fields)
- [Data Splits](#data-splits)
- [Dataset Creation](#dataset-creation)
- [Curation Rationale](#curation-rationale)
- [Source Data](#source-data)
- [Annotations](#annotations)
- [Personal and Sensitive Information](#personal-and-sensitive-information)
- [Considerations for Using the Data](#considerations-for-using-the-data)
- [Social Impact of Dataset](#social-impact-of-dataset)
- [Discussion of Biases](#discussion-of-biases)
- [Other Known Limitations](#other-known-limitations)
- [Additional Information](#additional-information)
- [Dataset Curators](#dataset-curators)
- [Licensing Information](#licensing-information)
- [Citation Information](#citation-information)
- [Contributions](#contributions)
## Dataset Description
- **Homepage:** [https://github.com/google-research-datasets/sentence-compression](https://github.com/google-research-datasets/sentence-compression)
- **Repository:** [https://github.com/google-research-datasets/sentence-compression](https://github.com/google-research-datasets/sentence-compression)
- **Paper:** [https://www.aclweb.org/anthology/D13-1155/](https://www.aclweb.org/anthology/D13-1155/)
- **Leaderboard:**
- **Point of Contact:**
### Dataset Summary
A major challenge in supervised sentence compression is making use of rich feature representations because of very scarce parallel data. We address this problem and present a method to automatically build a compression corpus with hundreds of thousands of instances on which deletion-based algorithms can be trained. In our corpus, the syntactic trees of the compressions are subtrees of their uncompressed counterparts, and hence supervised systems which require a structural alignment between the input and output can be successfully trained. We also extend an existing unsupervised compression method with a learning module. The new system uses structured prediction to learn from lexical, syntactic and other features. An evaluation with human raters shows that the presented data harvesting method indeed produces a parallel corpus of high quality. Also, the supervised system trained on this corpus gets high scores both from human raters and in an automatic evaluation setting, significantly outperforming a strong baseline.
### Supported Tasks and Leaderboards
[More Information Needed]
### Languages
English
## Dataset Structure
### Data Instances
Each data instance should contains the information about the original sentence in `instance["graph"]["sentence"]` as well as the compressed sentence in `instance["compression"]["text"]`. As this dataset was created by pruning dependency connections, the author also includes the dependency tree and transformed graph of the original sentence and compressed sentence.
### Data Fields
Each instance should contains these information:
- `graph` (`Dict`): the transformation graph/tree for extracting compression (a modified version of a dependency tree).
- This will have features similar to a dependency tree (listed bellow)
- `compression` (`Dict`)
- `text` (`str`)
- `edge` (`List`)
- `headline` (`str`): the headline of the original news page.
- `compression_ratio` (`float`): the ratio between compressed sentence vs original sentence.
- `doc_id` (`str`): url of the original news page.
- `source_tree` (`Dict`): the original dependency tree (features listed bellow).
- `compression_untransformed` (`Dict`)
- `text` (`str`)
- `edge` (`List`)
Dependency tree features:
- `id` (`str`)
- `sentence` (`str`)
- `node` (`List`): list of nodes, each node represent a word/word phrase in the tree.
- `form` (`string`)
- `type` (`string`): the enity type of a node. Defaults to `""` if it's not an entity.
- `mid` (`string`)
- `word` (`List`): list of words the node contains.
- `id` (`int`)
- `form` (`str`): the word from the sentence.
- `stem` (`str`): the stemmed/lemmatized version of the word.
- `tag` (`str`): dependency tag of the word.
- `gender` (`int`)
- `head_word_index` (`int`)
- `edge`: list of the dependency connections between words.
- `parent_id` (`int`)
- `child_id` (`int`)
- `label` (`str`)
- `entity_mention` list of the entities in the sentence.
- `start` (`int`)
- `end` (`int`)
- `head` (`str`)
- `name` (`str`)
- `type` (`str`)
- `mid` (`str`)
- `is_proper_name_entity` (`bool`)
- `gender` (`int`)
### Data Splits
[More Information Needed]
## Dataset Creation
### Curation Rationale
[More Information Needed]
### Source Data
#### Initial Data Collection and Normalization
[More Information Needed]
#### Who are the source language producers?
[More Information Needed]
### Annotations
#### Annotation process
[More Information Needed]
#### Who are the annotators?
[More Information Needed]
### Personal and Sensitive Information
[More Information Needed]
## Considerations for Using the Data
### Social Impact of Dataset
[More Information Needed]
### Discussion of Biases
[More Information Needed]
### Other Known Limitations
[More Information Needed]
## Additional Information
### Dataset Curators
[More Information Needed]
### Licensing Information
[More Information Needed]
### Citation Information
[More Information Needed]
### Contributions
Thanks to [@mattbui](https://github.com/mattbui) for adding this dataset.
# Google语句压缩数据集卡片
## 目录
- [数据集描述](#dataset-description)
- [数据集概况](#dataset-summary)
- [支持任务与排行榜](#supported-tasks-and-leaderboards)
- [语言](#languages)
- [数据集结构](#dataset-structure)
- [数据实例](#data-instances)
- [数据字段](#data-fields)
- [数据划分](#data-splits)
- [数据集构建](#dataset-creation)
- [构建初衷](#curation-rationale)
- [源数据](#source-data)
- [标注](#annotations)
- [个人与敏感信息](#personal-and-sensitive-information)
- [数据集使用注意事项](#considerations-for-using-the-data)
- [数据集的社会影响](#social-impact-of-dataset)
- [偏差讨论](#discussion-of-biases)
- [其他已知局限性](#other-known-limitations)
- [附加信息](#additional-information)
- [数据集维护者](#dataset-curators)
- [许可信息](#licensing-information)
- [引用信息](#citation-information)
- [贡献者](#contributions)
## 数据集描述
- **主页**:[https://github.com/google-research-datasets/sentence-compression](https://github.com/google-research-datasets/sentence-compression)
- **代码仓库**:[https://github.com/google-research-datasets/sentence-compression](https://github.com/google-research-datasets/sentence-compression)
- **相关论文**:[https://www.aclweb.org/anthology/D13-1155/](https://www.aclweb.org/anthology/D13-1155/)
- **排行榜**:
- **联系人**:
### 数据集概况
有监督语句压缩任务的一大挑战在于,并行语料极度匮乏,难以利用丰富的特征表示。针对该问题,本文提出一种方法,可自动构建包含数十万条实例的压缩语料库,用于训练基于删除机制的算法。本语料库中,压缩后的语句的句法树(syntactic tree)均为原未压缩语句句法树的子树,因此可成功训练需要输入输出结构对齐的有监督系统。此外,本文还为现有无监督压缩方法引入学习模块,新系统采用结构化预测,可从词汇、句法及其他特征中学习。人工评估结果表明,本文提出的数据采集方法可生成高质量的并行语料库;基于该语料库训练的有监督系统在人工评估与自动评估中均取得高分,显著优于强基准模型。
### 支持任务与排行榜
[需补充更多信息]
### 语言
英语
## 数据集结构
### 数据实例
每条数据实例均包含原语句信息(存储于`instance["graph"]["sentence"]`)与压缩后语句信息(存储于`instance["compression"]["text"]`)。由于本数据集通过剪枝依存连接生成,数据集还提供了原语句与压缩后语句的依存树(dependency tree)及转换图。
### 数据字段
每条实例包含以下信息:
- `graph`(Dict):用于提取压缩结果的转换图/树(依存树的修改版本)。
- 该结构包含与依存树类似的特征(如下所列)
- `compression`(Dict)
- `text`(str):压缩后语句文本
- `edge`(List):边列表
- `headline`(str):原新闻页面的标题
- `compression_ratio`(float):压缩语句与原语句的长度比率
- `doc_id`(str):原新闻页面的URL
- `source_tree`(Dict):原依存树(特征如下所列)
- `compression_untransformed`(Dict)
- `text`(str):未转换的压缩语句文本
- `edge`(List):边列表
依存树特征:
- `id`(str):编号
- `sentence`(str):语句文本
- `node`(List):节点列表,每个节点代表语句中的一个词/词块
- `form`(string):词形
- `type`(string):节点的实体类型,若非实体则默认为空字符串
- `mid`(string):mid值
- `word`(List):节点包含的词汇项列表
- `id`(int):词汇编号
- `form`(str):语句中的原词
- `stem`(str):词干/词元形式
- `tag`(str):词汇的依存标签
- `gender`(int):词性(或性别,依语种而定)
- `head_word_index`(int):中心词索引
- `edge`:词汇间的依存连接列表
- `parent_id`(int):父节点编号
- `child_id`(int):子节点编号
- `label`(str):边标签
- `entity_mention`:语句中的实体提及项列表
- `start`(int):实体起始位置
- `end`(int):实体结束位置
- `head`(str):实体中心词
- `name`(str):实体名称
- `type`(str):实体类型
- `mid`(str):mid值
- `is_proper_name_entity`(bool):是否为专有名词实体
- `gender`(int):实体性别(依语种而定)
### 数据划分
[需补充更多信息]
## 数据集构建
### 构建初衷
[需补充更多信息]
### 源数据
#### 初始数据收集与标准化
[需补充更多信息]
#### 源语言生产者是谁?
[需补充更多信息]
### 标注
#### 标注流程
[需补充更多信息]
#### 标注者是谁?
[需补充更多信息]
### 个人与敏感信息
[需补充更多信息]
## 数据集使用注意事项
### 数据集的社会影响
[需补充更多信息]
### 偏差讨论
[需补充更多信息]
### 其他已知局限性
[需补充更多信息]
## 附加信息
### 数据集维护者
[需补充更多信息]
### 许可信息
[需补充更多信息]
### 引用信息
[需补充更多信息]
### 贡献者
感谢[@mattbui](https://github.com/mattbui)贡献本数据集。
提供机构:
maas
创建时间:
2025-07-07



