A Novel Corpus of Discourse Structure in Humans and Computers

Name: A Novel Corpus of Discourse Structure in Humans and Computers
Creator: 布朗大学
Published: 2021-11-11 04:56:08
License: 暂无描述

arXiv2021-11-11 更新2024-06-21 收录

下载链接：

https://github.com/sfeucht/annotation_evaluation

下载链接

链接失效反馈

官方服务：

资源简介：

本数据集名为'A Novel Corpus of Discourse Structure in Humans and Computers'，由布朗大学创建，包含445份人类和计算机生成的文档，总计约27,000个语句，用于比较人工和自然话语模式的语义语句类型和连贯性关系。数据集涵盖了2008至2019年间关于大麻合法化的正式与非正式话语，通过精细调整的GPT-2和GPT-3模型生成。创建过程中，研究人员通过广泛的正则表达式筛选和手动检查，从Common Crawl和Reddit论坛中选取了相关内容。该数据集主要用于文本生成中的话语分析，旨在解决计算机生成文本与人类文本在质量上的差异问题。

This dataset, titled *A Novel Corpus of Discourse Structure in Humans and Computers*, was developed by Brown University. It comprises 445 human-written and computer-generated documents, with a total of approximately 27,000 sentences, and is intended to compare semantic sentence types and coherence relations between human and computer-generated discourse patterns. The corpus covers both formal and informal discourse regarding cannabis legalization spanning the period from 2008 to 2019, and was generated using fine-tuned GPT-2 and GPT-3 models. During its creation, researchers selected relevant content from Common Crawl and Reddit forums through extensive regular expression filtering and manual inspection. This dataset is primarily utilized for discourse analysis in text generation, with the objective of addressing the quality disparities between computer-generated and human-written text.

提供机构：

布朗大学

创建时间：

2021-11-11

5,000+

优质数据集

54 个

任务类型

进入经典数据集