coarse_discourse
收藏魔搭社区2025-12-05 更新2025-07-12 收录
下载链接:
https://modelscope.cn/datasets/google-research-datasets/coarse_discourse
下载链接
链接失效反馈官方服务:
资源简介:
# Dataset Card for "coarse_discourse"
## Table of Contents
- [Dataset Description](#dataset-description)
- [Dataset Summary](#dataset-summary)
- [Supported Tasks and Leaderboards](#supported-tasks-and-leaderboards)
- [Languages](#languages)
- [Dataset Structure](#dataset-structure)
- [Data Instances](#data-instances)
- [Data Fields](#data-fields)
- [Data Splits](#data-splits)
- [Dataset Creation](#dataset-creation)
- [Curation Rationale](#curation-rationale)
- [Source Data](#source-data)
- [Annotations](#annotations)
- [Personal and Sensitive Information](#personal-and-sensitive-information)
- [Considerations for Using the Data](#considerations-for-using-the-data)
- [Social Impact of Dataset](#social-impact-of-dataset)
- [Discussion of Biases](#discussion-of-biases)
- [Other Known Limitations](#other-known-limitations)
- [Additional Information](#additional-information)
- [Dataset Curators](#dataset-curators)
- [Licensing Information](#licensing-information)
- [Citation Information](#citation-information)
- [Contributions](#contributions)
## Dataset Description
- **Homepage:**
- **Repository:** https://github.com/google-research-datasets/coarse-discourse
- **Paper:** [Characterizing Online Discussion Using Coarse Discourse Sequences](https://research.google/pubs/pub46055/)
- **Point of Contact:** [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
- **Size of downloaded dataset files:** 4.63 MB
- **Size of the generated dataset:** 45.45 MB
- **Total amount of disk used:** 50.08 MB
### Dataset Summary
A large corpus of discourse annotations and relations on ~10K forum threads.
We collect and release a corpus of over 9,000 threads comprising over 100,000 comments manually annotated via paid crowdsourcing with discourse acts and randomly sampled from the site Reddit.
### Supported Tasks and Leaderboards
[More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
### Languages
[More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
## Dataset Structure
### Data Instances
#### default
- **Size of downloaded dataset files:** 4.63 MB
- **Size of the generated dataset:** 45.45 MB
- **Total amount of disk used:** 50.08 MB
An example of 'train' looks as follows.
```
{
"annotations": {
"annotator": ["fc96a15ab87f02dd1998ff55a64f6478", "e9e4b3ab355135fa954badcc06bfccc6", "31ac59c1734c1547d4d0723ff254c247"],
"link_to_post": ["", "", ""],
"main_type": ["elaboration", "elaboration", "elaboration"]
},
"id_post": "t1_c9b30i1",
"in_reply_to": "t1_c9b2nyd",
"is_first_post": false,
"is_self_post": true,
"majority_link": "t1_c9b2nyd",
"majority_type": "elaboration",
"post_depth": 2,
"subreddit": "100movies365days",
"title": "DTX120: #87 - Nashville",
"url": "https://www.reddit.com/r/100movies365days/comments/1bx6qw/dtx120_87_nashville/"
}
```
### Data Fields
The data fields are the same among all splits.
#### default
- `title`: a `string` feature.
- `is_self_post`: a `bool` feature.
- `subreddit`: a `string` feature.
- `url`: a `string` feature.
- `majority_link`: a `string` feature.
- `is_first_post`: a `bool` feature.
- `majority_type`: a `string` feature.
- `id_post`: a `string` feature.
- `post_depth`: a `int32` feature.
- `in_reply_to`: a `string` feature.
- `annotations`: a dictionary feature containing:
- `annotator`: a `string` feature.
- `link_to_post`: a `string` feature.
- `main_type`: a `string` feature.
### Data Splits
| name |train |
|-------|-----:|
|default|116357|
## Dataset Creation
### Curation Rationale
[More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
### Source Data
#### Initial Data Collection and Normalization
[More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
#### Who are the source language producers?
[More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
### Annotations
#### Annotation process
[More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
#### Who are the annotators?
[More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
### Personal and Sensitive Information
[More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
## Considerations for Using the Data
### Social Impact of Dataset
[More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
### Discussion of Biases
[More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
### Other Known Limitations
[More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
## Additional Information
### Dataset Curators
[More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
### Licensing Information
[More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
### Citation Information
```
@inproceedings{coarsediscourse, title={Characterizing Online Discussion Using Coarse Discourse Sequences}, author={Zhang, Amy X. and Culbertson, Bryan and Paritosh, Praveen}, booktitle={Proceedings of the 11th International AAAI Conference on Weblogs and Social Media}, series={ICWSM '17}, year={2017}, location = {Montreal, Canada} }
```
### Contributions
Thanks to [@thomwolf](https://github.com/thomwolf), [@lewtun](https://github.com/lewtun), [@jplu](https://github.com/jplu) for adding this dataset.
# 数据集卡片:"coarse_discourse"
## 目录
- [数据集描述](#dataset-description)
- [数据集摘要](#dataset-summary)
- [支持的任务与评测基准](#supported-tasks-and-leaderboards)
- [语言](#languages)
- [数据集结构](#dataset-structure)
- [数据实例](#data-instances)
- [数据字段](#data-fields)
- [数据拆分](#data-splits)
- [数据集构建](#dataset-creation)
- [数据遴选动因](#curation-rationale)
- [源数据](#source-data)
- [标注流程](#annotations)
- [个人与敏感信息](#personal-and-sensitive-information)
- [数据集使用注意事项](#considerations-for-using-the-data)
- [数据集的社会影响](#social-impact-of-dataset)
- [偏差讨论](#discussion-of-biases)
- [其他已知局限性](#other-known-limitations)
- [附加信息](#additional-information)
- [数据集维护者](#dataset-curators)
- [许可信息](#licensing-information)
- [引用信息](#citation-information)
- [贡献](#contributions)
## 数据集描述
- **主页:**
- **代码仓库:** https://github.com/google-research-datasets/coarse-discourse
- **论文:** [《基于粗话语(Coarse Discourse)序列刻画在线讨论特征》](https://research.google/pubs/pub46055/)
- **联系方式:** [更多信息待补充](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
- **下载数据集文件大小:** 4.63 MB
- **生成的数据集大小:** 45.45 MB
- **总磁盘占用:** 50.08 MB
### 数据集摘要
本数据集包含约1万个论坛主题帖的话语标注及关联关系大型语料库。
我们从Reddit平台随机采样构建并发布了包含9000余个主题帖、逾10万条评论的语料库,所有文本均通过付费众包方式完成话语行为(discourse act)标注。
### 支持的任务与评测基准
[更多信息待补充](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
### 语言
[更多信息待补充](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
## 数据集结构
### 数据实例
#### 默认配置
- **下载数据集文件大小:** 4.63 MB
- **生成的数据集大小:** 45.45 MB
- **总磁盘占用:** 50.08 MB
训练集的一个示例如下:
{
"annotations": {
"annotator": ["fc96a15ab87f02dd1998ff55a64f6478", "e9e4b3ab355135fa954badcc06bfccc6", "31ac59c1734c1547d4d0723ff254c247"],
"link_to_post": ["", "", ""],
"main_type": ["elaboration", "elaboration", "elaboration"]
},
"id_post": "t1_c9b30i1",
"in_reply_to": "t1_c9b2nyd",
"is_first_post": false,
"is_self_post": true,
"majority_link": "t1_c9b2nyd",
"majority_type": "elaboration",
"post_depth": 2,
"subreddit": "100movies365days",
"title": "DTX120: #87 - Nashville",
"url": "https://www.reddit.com/r/100movies365days/comments/1bx6qw/dtx120_87_nashville/"
}
### 数据字段
所有拆分下的数据字段均保持一致。
#### 默认配置
- `title`:字符串类型特征。
- `is_self_post`:布尔类型特征。
- `subreddit`:字符串类型特征。
- `url`:字符串类型特征。
- `majority_link`:字符串类型特征。
- `is_first_post`:布尔类型特征。
- `majority_type`:字符串类型特征。
- `id_post`:字符串类型特征。
- `post_depth`:int32类型特征。
- `in_reply_to`:字符串类型特征。
- `annotations`:字典类型特征,包含:
- `annotator`:字符串类型特征。
- `link_to_post`:字符串类型特征。
- `main_type`:字符串类型特征。
### 数据拆分
| 拆分名称 | 训练集样本数 |
|---------|-------------:|
| default | 116357 |
## 数据集构建
### 数据遴选动因
[更多信息待补充](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
### 源数据
#### 初始数据采集与归一化
[更多信息待补充](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
#### 源语言生产者为何人?
[更多信息待补充](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
### 标注流程
#### 标注过程
[更多信息待补充](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
#### 标注者为何人?
[更多信息待补充](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
### 个人与敏感信息
[更多信息待补充](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
## 数据集使用注意事项
### 数据集的社会影响
[更多信息待补充](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
### 偏差讨论
[更多信息待补充](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
### 其他已知局限性
[更多信息待补充](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
## 附加信息
### 数据集维护者
[更多信息待补充](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
### 许可信息
[更多信息待补充](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
### 引用信息
@inproceedings{coarsediscourse, title={Characterizing Online Discussion Using Coarse Discourse Sequences}, author={Zhang, Amy X. and Culbertson, Bryan and Paritosh, Praveen}, booktitle={Proceedings of the 11th International AAAI Conference on Weblogs and Social Media}, series={ICWSM '17}, year={2017}, location = {Montreal, Canada} }
### 贡献
感谢[@thomwolf](https://github.com/thomwolf)、[@lewtun](https://github.com/lewtun)、[@jplu](https://github.com/jplu)为本数据集的添加工作。
提供机构:
maas
创建时间:
2025-07-07



