symanto/autextification2023
收藏Hugging Face2024-06-14 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/symanto/autextification2023
下载链接
链接失效反馈官方服务:
资源简介:
---
license: cc-by-nc-sa-4.0
task_categories:
- text-classification
language:
- en
- es
pretty_name: AuTexTification 2023
size_categories:
- 10K<n<100K
source_datasets:
- multi_eurlex
- xsum
- csebuetnlp/xlsum
- mlsum
- amazon_polarity
- https://sinai.ujaen.es/investigacion/recursos/coah
- https://sinai.ujaen.es/investigacion/recursos/coar
- carblacac/twitter-sentiment-analysis
- cardiffnlp/tweet_sentiment_multilingual
- https://www.kaggle.com/datasets/ricardomoya/tweets-poltica-espaa
- wiki_lingua
---
# Dataset Card for AuTexTification 2023
## Dataset Description
- **Homepage:** https://sites.google.com/view/autextification
- **Repository:** https://github.com/autextification/AuTexTification-Overview
- **Paper:** https://arxiv.org/abs/2309.11285
### Dataset Summary
AuTexTification 2023 @IberLEF2023 is a shared task focusing in Machine-Generated Text Detection and Model Attribution in English and Spanish.
The dataset includes human and generated text in 5 domains: tweets, reviews, how-to articles, news, and legal documents.
The generations are obtained using six language models: BLOOM-1B1, BLOOM-3B, BLOOM-7B1, Babbage, Curie, and text-davinci-003.
For more information, please refer to our overview paper: https://arxiv.org/abs/2309.11285
### Supported Tasks and Leaderboards
- Machine-Generated Text Detection
- Model Attribution
### Languages
English and Spanish
## Dataset Structure
### Data Instances
163k instances of labeled text in total.
### Data Fields
For MGT Detection:
- id
- prompt
- text
- label
- model
- domain
For Model Attribution:
- id
- prompt
- text
- label
- domain
### Data Splits
- MGT Detection Data:
| Language | Split | Human | Generated | Total |
| -------- | ----- | ------ | --------- | ------ |
| English | Train | 17.046 | 16.799 | 33.845 |
| | Test | 10.642 | 11.190 | 21.832 |
| | Total | 27.688 | 27.989 | 55.667 |
| Spanish | Train | 15.787 | 16.275 | 32.062 |
| | Test | 11.209 | 8.920 | 20.129 |
| | Total | 26.996 | 25.195 | 52.191 |
- Model Attribution Data:
| | | BLOOM | | | GPT | | | |
| -------- | ----- | ----- | ----- | ----- | ------- | ----- | ---------------- | ------ |
| Language | Split | 1B7 | 3B | 7B | babbage | curie | text-davinci-003 | Total |
| English | Train | 3.562 | 3.648 | 3.687 | 3.870 | 3.822 | 3.827 | 22.416 |
| | Test | 887 | 875 | 952 | 924 | 979 | 988 | 5.605 |
| | Total | 4.449 | 4.523 | 4.639 | 4.794 | 4.801 | 4.815 | 28.021 |
| Spanish | Train | 3.422 | 3.514 | 3.575 | 3.788 | 3.770 | 3.866 | 21.935 |
| | Test | 870 | 867 | 878 | 946 | 1.004 | 917 | 5.482 |
| | Total | 4.292 | 4.381 | 4.453 | 4.734 | 4.774 | 4.783 | 27.417 |
## Dataset Creation
### Curation Rationale
Human data was gathered and used to prompt language models, obtaining generated data.
Specific decisions were made to ensure the data gathering process was carried out in an unbiased manner, making the final human and generated texts probable continuations of a given prefix.
For more detailed information, please refer to the overview paper: https://arxiv.org/abs/2309.11285
### Source Data
The following datasets were used as human text:
- multi_eurlex
- xsum
- csebuetnlp/xlsum
- mlsum
- amazon_polarity
- https://sinai.ujaen.es/investigacion/recursos/coah
- https://sinai.ujaen.es/investigacion/recursos/coar
- carblacac/twitter-sentiment-analysis
- cardiffnlp/tweet_sentiment_multilingual
- https://www.kaggle.com/datasets/ricardomoya/tweets-poltica-espaa
- wiki_lingua
These datasets were only used as sources of human text. The labels of the datasets were not employed in any manner.
### Licensing Information
CC-BY-NC-SA-4.0
### Citation Information
```
@inproceedings{autextification2023,
title = "Overview of AuTexTification at IberLEF 2023: Detection and Attribution of Machine-Generated Text in Multiple Domains",
author = "Sarvazyan, Areg Mikael and
Gonz{\'a}lez, Jos{\'e} {\'A}ngel and
Franco-Salvador, Marc and
Rangel, Francisco and
Chulvi, Berta and
Rosso, Paolo",
month = sep,
year = "2023",
address = "Jaén, Spain",
booktitle = "Procesamiento del Lenguaje Natural",
}
```
提供机构:
symanto
原始信息汇总
数据集概述
- 名称: AuTexTification 2023
- 任务类别:
- 文本分类
- 语言:
- 英语
- 西班牙语
- 大小: 10K<n<100K
- 许可: CC-BY-NC-SA-4.0
数据集详细信息
- 摘要: AuTexTification 2023 是一个专注于机器生成文本检测和模型归属的共享任务,涵盖英语和西班牙语。数据集包含5个领域的文本:推文、评论、操作指南、新闻和法律文档。
- 支持的任务:
- 机器生成文本检测
- 模型归属
- 数据结构:
- 数据实例: 总计163,000个标记文本实例。
- 数据字段:
- 机器生成文本检测: id, prompt, text, label, model, domain
- 模型归属: id, prompt, text, label, domain
- 数据分割:
- 机器生成文本检测:
- 英语: 训练集33,845个实例,测试集21,832个实例
- 西班牙语: 训练集32,062个实例,测试集20,129个实例
- 模型归属:
- 英语: 训练集14,767个实例,测试集3,638个实例
- 西班牙语: 训练集14,299个实例,测试集3,561个实例
- 机器生成文本检测:
数据集创建
-
源数据:
- 用于获取人类文本的数据集: multi_eurlex, xsum, csebuetnlp/xlsum, mlsum, amazon_polarity, https://sinai.ujaen.es/investigacion/recursos/coah, https://sinai.ujaen.es/investigacion/recursos/coar, carblacac/twitter-sentiment-analysis, cardiffnlp/tweet_sentiment_multilingual, https://www.kaggle.com/datasets/ricardomoya/tweets-poltica-espaa, wiki_lingua
-
许可信息: CC-BY-NC-SA-4.0
-
引用信息:
@inproceedings{autextification2023, title = "Overview of AuTexTification at IberLEF 2023: Detection and Attribution of Machine-Generated Text in Multiple Domains", author = "Sarvazyan, Areg Mikael and Gonz{a}lez, Jos{e} {A}ngel and Franco-Salvador, Marc and Rangel, Francisco and Chulvi, Berta and Rosso, Paolo", month = sep, year = "2023", address = "Jaén, Spain", booktitle = "Procesamiento del Lenguaje Natural", }



