symanto/autextification2023

Name: symanto/autextification2023
Creator: symanto
Published: 2024-06-14 13:16:52
License: 暂无描述

Hugging Face2024-06-14 更新2024-03-04 收录

下载链接：

https://hf-mirror.com/datasets/symanto/autextification2023

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: cc-by-nc-sa-4.0 task_categories: - text-classification language: - en - es pretty_name: AuTexTification 2023 size_categories: - 10K<n<100K source_datasets: - multi_eurlex - xsum - csebuetnlp/xlsum - mlsum - amazon_polarity - https://sinai.ujaen.es/investigacion/recursos/coah - https://sinai.ujaen.es/investigacion/recursos/coar - carblacac/twitter-sentiment-analysis - cardiffnlp/tweet_sentiment_multilingual - https://www.kaggle.com/datasets/ricardomoya/tweets-poltica-espaa - wiki_lingua --- # Dataset Card for AuTexTification 2023 ## Dataset Description - **Homepage:** https://sites.google.com/view/autextification - **Repository:** https://github.com/autextification/AuTexTification-Overview - **Paper:** https://arxiv.org/abs/2309.11285 ### Dataset Summary AuTexTification 2023 @IberLEF2023 is a shared task focusing in Machine-Generated Text Detection and Model Attribution in English and Spanish. The dataset includes human and generated text in 5 domains: tweets, reviews, how-to articles, news, and legal documents. The generations are obtained using six language models: BLOOM-1B1, BLOOM-3B, BLOOM-7B1, Babbage, Curie, and text-davinci-003. For more information, please refer to our overview paper: https://arxiv.org/abs/2309.11285 ### Supported Tasks and Leaderboards - Machine-Generated Text Detection - Model Attribution ### Languages English and Spanish ## Dataset Structure ### Data Instances 163k instances of labeled text in total. ### Data Fields For MGT Detection: - id - prompt - text - label - model - domain For Model Attribution: - id - prompt - text - label - domain ### Data Splits - MGT Detection Data: | Language | Split | Human | Generated | Total | | -------- | ----- | ------ | --------- | ------ | | English | Train | 17.046 | 16.799 | 33.845 | | | Test | 10.642 | 11.190 | 21.832 | | | Total | 27.688 | 27.989 | 55.667 | | Spanish | Train | 15.787 | 16.275 | 32.062 | | | Test | 11.209 | 8.920 | 20.129 | | | Total | 26.996 | 25.195 | 52.191 | - Model Attribution Data: | | | BLOOM | | | GPT | | | | | -------- | ----- | ----- | ----- | ----- | ------- | ----- | ---------------- | ------ | | Language | Split | 1B7 | 3B | 7B | babbage | curie | text-davinci-003 | Total | | English | Train | 3.562 | 3.648 | 3.687 | 3.870 | 3.822 | 3.827 | 22.416 | | | Test | 887 | 875 | 952 | 924 | 979 | 988 | 5.605 | | | Total | 4.449 | 4.523 | 4.639 | 4.794 | 4.801 | 4.815 | 28.021 | | Spanish | Train | 3.422 | 3.514 | 3.575 | 3.788 | 3.770 | 3.866 | 21.935 | | | Test | 870 | 867 | 878 | 946 | 1.004 | 917 | 5.482 | | | Total | 4.292 | 4.381 | 4.453 | 4.734 | 4.774 | 4.783 | 27.417 | ## Dataset Creation ### Curation Rationale Human data was gathered and used to prompt language models, obtaining generated data. Specific decisions were made to ensure the data gathering process was carried out in an unbiased manner, making the final human and generated texts probable continuations of a given prefix. For more detailed information, please refer to the overview paper: https://arxiv.org/abs/2309.11285 ### Source Data The following datasets were used as human text: - multi_eurlex - xsum - csebuetnlp/xlsum - mlsum - amazon_polarity - https://sinai.ujaen.es/investigacion/recursos/coah - https://sinai.ujaen.es/investigacion/recursos/coar - carblacac/twitter-sentiment-analysis - cardiffnlp/tweet_sentiment_multilingual - https://www.kaggle.com/datasets/ricardomoya/tweets-poltica-espaa - wiki_lingua These datasets were only used as sources of human text. The labels of the datasets were not employed in any manner. ### Licensing Information CC-BY-NC-SA-4.0 ### Citation Information ``` @inproceedings{autextification2023, title = "Overview of AuTexTification at IberLEF 2023: Detection and Attribution of Machine-Generated Text in Multiple Domains", author = "Sarvazyan, Areg Mikael and Gonz{\'a}lez, Jos{\'e} {\'A}ngel and Franco-Salvador, Marc and Rangel, Francisco and Chulvi, Berta and Rosso, Paolo", month = sep, year = "2023", address = "Jaén, Spain", booktitle = "Procesamiento del Lenguaje Natural", } ```

提供机构：

symanto

原始信息汇总

数据集概述

名称: AuTexTification 2023
任务类别:
- 文本分类
语言:
- 英语
- 西班牙语
大小: 10K<n<100K
许可: CC-BY-NC-SA-4.0

数据集详细信息

摘要: AuTexTification 2023 是一个专注于机器生成文本检测和模型归属的共享任务，涵盖英语和西班牙语。数据集包含5个领域的文本：推文、评论、操作指南、新闻和法律文档。
支持的任务:
- 机器生成文本检测
- 模型归属
数据结构:
- 数据实例: 总计163,000个标记文本实例。
- 数据字段:
  - 机器生成文本检测: id, prompt, text, label, model, domain
  - 模型归属: id, prompt, text, label, domain
数据分割:
- 机器生成文本检测:
  - 英语: 训练集33,845个实例，测试集21,832个实例
  - 西班牙语: 训练集32,062个实例，测试集20,129个实例
- 模型归属:
  - 英语: 训练集14,767个实例，测试集3,638个实例
  - 西班牙语: 训练集14,299个实例，测试集3,561个实例

数据集创建

源数据:
- 用于获取人类文本的数据集: multi_eurlex, xsum, csebuetnlp/xlsum, mlsum, amazon_polarity, https://sinai.ujaen.es/investigacion/recursos/coah, https://sinai.ujaen.es/investigacion/recursos/coar, carblacac/twitter-sentiment-analysis, cardiffnlp/tweet_sentiment_multilingual, https://www.kaggle.com/datasets/ricardomoya/tweets-poltica-espaa, wiki_lingua
许可信息: CC-BY-NC-SA-4.0
引用信息:

@inproceedings{autextification2023, title = "Overview of AuTexTification at IberLEF 2023: Detection and Attribution of Machine-Generated Text in Multiple Domains", author = "Sarvazyan, Areg Mikael and Gonz{a}lez, Jos{e} {A}ngel and Franco-Salvador, Marc and Rangel, Francisco and Chulvi, Berta and Rosso, Paolo", month = sep, year = "2023", address = "Jaén, Spain", booktitle = "Procesamiento del Lenguaje Natural", }

5,000+

优质数据集

54 个

任务类型

进入经典数据集