google-research-datasets/coarse_discourse

Name: google-research-datasets/coarse_discourse
Creator: google-research-datasets
Published: 2024-01-18 15:32:32
License: 暂无描述

Hugging Face2024-01-18 更新2024-06-15 收录

下载链接：

https://hf-mirror.com/datasets/google-research-datasets/coarse_discourse

下载链接

链接失效反馈

官方服务：

资源简介：

--- annotations_creators: - crowdsourced language_creators: - found language: - en license: - cc-by-4.0 multilinguality: - monolingual size_categories: - 100K<n<1M source_datasets: - original task_categories: - text-classification task_ids: - multi-class-classification paperswithcode_id: coarse-discourse pretty_name: Coarse Discourse dataset_info: features: - name: title dtype: string - name: is_self_post dtype: bool - name: subreddit dtype: string - name: url dtype: string - name: majority_link dtype: string - name: is_first_post dtype: bool - name: majority_type dtype: string - name: id_post dtype: string - name: post_depth dtype: int32 - name: in_reply_to dtype: string - name: annotations sequence: - name: annotator dtype: string - name: link_to_post dtype: string - name: main_type dtype: string splits: - name: train num_bytes: 45097556 num_examples: 116357 download_size: 4256575 dataset_size: 45097556 configs: - config_name: default data_files: - split: train path: data/train-* --- # Dataset Card for "coarse_discourse" ## Table of Contents - [Dataset Description](#dataset-description) - [Dataset Summary](#dataset-summary) - [Supported Tasks and Leaderboards](#supported-tasks-and-leaderboards) - [Languages](#languages) - [Dataset Structure](#dataset-structure) - [Data Instances](#data-instances) - [Data Fields](#data-fields) - [Data Splits](#data-splits) - [Dataset Creation](#dataset-creation) - [Curation Rationale](#curation-rationale) - [Source Data](#source-data) - [Annotations](#annotations) - [Personal and Sensitive Information](#personal-and-sensitive-information) - [Considerations for Using the Data](#considerations-for-using-the-data) - [Social Impact of Dataset](#social-impact-of-dataset) - [Discussion of Biases](#discussion-of-biases) - [Other Known Limitations](#other-known-limitations) - [Additional Information](#additional-information) - [Dataset Curators](#dataset-curators) - [Licensing Information](#licensing-information) - [Citation Information](#citation-information) - [Contributions](#contributions) ## Dataset Description - **Homepage:** - **Repository:** https://github.com/google-research-datasets/coarse-discourse - **Paper:** [Characterizing Online Discussion Using Coarse Discourse Sequences](https://research.google/pubs/pub46055/) - **Point of Contact:** [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) - **Size of downloaded dataset files:** 4.63 MB - **Size of the generated dataset:** 45.45 MB - **Total amount of disk used:** 50.08 MB ### Dataset Summary A large corpus of discourse annotations and relations on ~10K forum threads. We collect and release a corpus of over 9,000 threads comprising over 100,000 comments manually annotated via paid crowdsourcing with discourse acts and randomly sampled from the site Reddit. ### Supported Tasks and Leaderboards [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### Languages [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ## Dataset Structure ### Data Instances #### default - **Size of downloaded dataset files:** 4.63 MB - **Size of the generated dataset:** 45.45 MB - **Total amount of disk used:** 50.08 MB An example of 'train' looks as follows. ``` { "annotations": { "annotator": ["fc96a15ab87f02dd1998ff55a64f6478", "e9e4b3ab355135fa954badcc06bfccc6", "31ac59c1734c1547d4d0723ff254c247"], "link_to_post": ["", "", ""], "main_type": ["elaboration", "elaboration", "elaboration"] }, "id_post": "t1_c9b30i1", "in_reply_to": "t1_c9b2nyd", "is_first_post": false, "is_self_post": true, "majority_link": "t1_c9b2nyd", "majority_type": "elaboration", "post_depth": 2, "subreddit": "100movies365days", "title": "DTX120: #87 - Nashville", "url": "https://www.reddit.com/r/100movies365days/comments/1bx6qw/dtx120_87_nashville/" } ``` ### Data Fields The data fields are the same among all splits. #### default - `title`: a `string` feature. - `is_self_post`: a `bool` feature. - `subreddit`: a `string` feature. - `url`: a `string` feature. - `majority_link`: a `string` feature. - `is_first_post`: a `bool` feature. - `majority_type`: a `string` feature. - `id_post`: a `string` feature. - `post_depth`: a `int32` feature. - `in_reply_to`: a `string` feature. - `annotations`: a dictionary feature containing: - `annotator`: a `string` feature. - `link_to_post`: a `string` feature. - `main_type`: a `string` feature. ### Data Splits | name |train | |-------|-----:| |default|116357| ## Dataset Creation ### Curation Rationale [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### Source Data #### Initial Data Collection and Normalization [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) #### Who are the source language producers? [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### Annotations #### Annotation process [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) #### Who are the annotators? [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### Personal and Sensitive Information [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ## Considerations for Using the Data ### Social Impact of Dataset [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### Discussion of Biases [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### Other Known Limitations [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ## Additional Information ### Dataset Curators [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### Licensing Information [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### Citation Information ``` @inproceedings{coarsediscourse, title={Characterizing Online Discussion Using Coarse Discourse Sequences}, author={Zhang, Amy X. and Culbertson, Bryan and Paritosh, Praveen}, booktitle={Proceedings of the 11th International AAAI Conference on Weblogs and Social Media}, series={ICWSM '17}, year={2017}, location = {Montreal, Canada} } ``` ### Contributions Thanks to [@thomwolf](https://github.com/thomwolf), [@lewtun](https://github.com/lewtun), [@jplu](https://github.com/jplu) for adding this dataset.

提供机构：

google-research-datasets

原始信息汇总

数据集概述

数据集摘要

Coarse Discourse 数据集是一个包含约10,000个论坛帖子的大规模语料库，涵盖超过100,000条评论，这些评论通过付费众包手动标注了话语行为，并从Reddit网站随机抽样得到。

支持的任务和排行榜

该数据集支持的任务类别是文本分类，具体任务是多类别分类。

语言

数据集的语言是英语。

数据集结构

数据实例

一个训练集的示例如下：

json { "annotations": { "annotator": ["fc96a15ab87f02dd1998ff55a64f6478", "e9e4b3ab355135fa954badcc06bfccc6", "31ac59c1734c1547d4d0723ff254c247"], "link_to_post": ["", "", ""], "main_type": ["elaboration", "elaboration", "elaboration"] }, "id_post": "t1_c9b30i1", "in_reply_to": "t1_c9b2nyd", "is_first_post": false, "is_self_post": true, "majority_link": "t1_c9b2nyd", "majority_type": "elaboration", "post_depth": 2, "subreddit": "100movies365days", "title": "DTX120: #87 - Nashville", "url": "https://www.reddit.com/r/100movies365days/comments/1bx6qw/dtx120_87_nashville/" }

数据字段

数据集包含以下字段：

title: 字符串类型，帖子标题。
is_self_post: 布尔类型，是否为自发布。
subreddit: 字符串类型，子版块名称。
url: 字符串类型，帖子URL。
majority_link: 字符串类型，主要链接。
is_first_post: 布尔类型，是否为首帖。
majority_type: 字符串类型，主要类型。
id_post: 字符串类型，帖子ID。
post_depth: 整数类型，帖子深度。
in_reply_to: 字符串类型，回复对象。
annotations: 字典类型，包含以下子字段：
- annotator: 字符串类型，标注者。
- link_to_post: 字符串类型，帖子链接。
- main_type: 字符串类型，主要类型。

数据分割

数据集只有一个训练集，包含116,357个样本。

数据集创建

数据集来源

数据集的来源是原始数据。

标注过程

数据集的标注是由众包完成的。

许可证信息

数据集的许可证是CC-BY-4.0。

引用信息

bibtex @inproceedings{coarsediscourse, title={Characterizing Online Discussion Using Coarse Discourse Sequences}, author={Zhang, Amy X. and Culbertson, Bryan and Paritosh, Praveen}, booktitle={Proceedings of the 11th International AAAI Conference on Weblogs and Social Media}, series={ICWSM 17}, year={2017}, location = {Montreal, Canada} }

搜集汇总

数据集介绍

背景与挑战

背景概述

Coarse Discourse数据集是一个包含约10,000个Reddit论坛线程和超过100,000条评论的大型语料库，这些评论通过付费众包进行了话语行为标注。数据集适用于多类文本分类任务，主要用于研究在线讨论的话语结构。

以上内容由遇见数据集搜集并总结生成

5,000+

优质数据集

54 个

任务类型

进入经典数据集