five

symanto/autextification2023

收藏
Hugging Face2024-06-14 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/symanto/autextification2023
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: cc-by-nc-sa-4.0 task_categories: - text-classification language: - en - es pretty_name: AuTexTification 2023 size_categories: - 10K<n<100K source_datasets: - multi_eurlex - xsum - csebuetnlp/xlsum - mlsum - amazon_polarity - https://sinai.ujaen.es/investigacion/recursos/coah - https://sinai.ujaen.es/investigacion/recursos/coar - carblacac/twitter-sentiment-analysis - cardiffnlp/tweet_sentiment_multilingual - https://www.kaggle.com/datasets/ricardomoya/tweets-poltica-espaa - wiki_lingua --- # Dataset Card for AuTexTification 2023 ## Dataset Description - **Homepage:** https://sites.google.com/view/autextification - **Repository:** https://github.com/autextification/AuTexTification-Overview - **Paper:** https://arxiv.org/abs/2309.11285 ### Dataset Summary AuTexTification 2023 @IberLEF2023 is a shared task focusing in Machine-Generated Text Detection and Model Attribution in English and Spanish. The dataset includes human and generated text in 5 domains: tweets, reviews, how-to articles, news, and legal documents. The generations are obtained using six language models: BLOOM-1B1, BLOOM-3B, BLOOM-7B1, Babbage, Curie, and text-davinci-003. For more information, please refer to our overview paper: https://arxiv.org/abs/2309.11285 ### Supported Tasks and Leaderboards - Machine-Generated Text Detection - Model Attribution ### Languages English and Spanish ## Dataset Structure ### Data Instances 163k instances of labeled text in total. ### Data Fields For MGT Detection: - id - prompt - text - label - model - domain For Model Attribution: - id - prompt - text - label - domain ### Data Splits - MGT Detection Data: | Language | Split | Human | Generated | Total | | -------- | ----- | ------ | --------- | ------ | | English | Train | 17.046 | 16.799 | 33.845 | | | Test | 10.642 | 11.190 | 21.832 | | | Total | 27.688 | 27.989 | 55.667 | | Spanish | Train | 15.787 | 16.275 | 32.062 | | | Test | 11.209 | 8.920 | 20.129 | | | Total | 26.996 | 25.195 | 52.191 | - Model Attribution Data: | | | BLOOM | | | GPT | | | | | -------- | ----- | ----- | ----- | ----- | ------- | ----- | ---------------- | ------ | | Language | Split | 1B7 | 3B | 7B | babbage | curie | text-davinci-003 | Total | | English | Train | 3.562 | 3.648 | 3.687 | 3.870 | 3.822 | 3.827 | 22.416 | | | Test | 887 | 875 | 952 | 924 | 979 | 988 | 5.605 | | | Total | 4.449 | 4.523 | 4.639 | 4.794 | 4.801 | 4.815 | 28.021 | | Spanish | Train | 3.422 | 3.514 | 3.575 | 3.788 | 3.770 | 3.866 | 21.935 | | | Test | 870 | 867 | 878 | 946 | 1.004 | 917 | 5.482 | | | Total | 4.292 | 4.381 | 4.453 | 4.734 | 4.774 | 4.783 | 27.417 | ## Dataset Creation ### Curation Rationale Human data was gathered and used to prompt language models, obtaining generated data. Specific decisions were made to ensure the data gathering process was carried out in an unbiased manner, making the final human and generated texts probable continuations of a given prefix. For more detailed information, please refer to the overview paper: https://arxiv.org/abs/2309.11285 ### Source Data The following datasets were used as human text: - multi_eurlex - xsum - csebuetnlp/xlsum - mlsum - amazon_polarity - https://sinai.ujaen.es/investigacion/recursos/coah - https://sinai.ujaen.es/investigacion/recursos/coar - carblacac/twitter-sentiment-analysis - cardiffnlp/tweet_sentiment_multilingual - https://www.kaggle.com/datasets/ricardomoya/tweets-poltica-espaa - wiki_lingua These datasets were only used as sources of human text. The labels of the datasets were not employed in any manner. ### Licensing Information CC-BY-NC-SA-4.0 ### Citation Information ``` @inproceedings{autextification2023, title = "Overview of AuTexTification at IberLEF 2023: Detection and Attribution of Machine-Generated Text in Multiple Domains", author = "Sarvazyan, Areg Mikael and Gonz{\'a}lez, Jos{\'e} {\'A}ngel and Franco-Salvador, Marc and Rangel, Francisco and Chulvi, Berta and Rosso, Paolo", month = sep, year = "2023", address = "Jaén, Spain", booktitle = "Procesamiento del Lenguaje Natural", } ```
提供机构:
symanto
原始信息汇总

数据集概述

  • 名称: AuTexTification 2023
  • 任务类别:
    • 文本分类
  • 语言:
    • 英语
    • 西班牙语
  • 大小: 10K<n<100K
  • 许可: CC-BY-NC-SA-4.0

数据集详细信息

  • 摘要: AuTexTification 2023 是一个专注于机器生成文本检测和模型归属的共享任务,涵盖英语和西班牙语。数据集包含5个领域的文本:推文、评论、操作指南、新闻和法律文档。
  • 支持的任务:
    • 机器生成文本检测
    • 模型归属
  • 数据结构:
    • 数据实例: 总计163,000个标记文本实例。
    • 数据字段:
      • 机器生成文本检测: id, prompt, text, label, model, domain
      • 模型归属: id, prompt, text, label, domain
  • 数据分割:
    • 机器生成文本检测:
      • 英语: 训练集33,845个实例,测试集21,832个实例
      • 西班牙语: 训练集32,062个实例,测试集20,129个实例
    • 模型归属:
      • 英语: 训练集14,767个实例,测试集3,638个实例
      • 西班牙语: 训练集14,299个实例,测试集3,561个实例

数据集创建

  • 源数据:

    • 用于获取人类文本的数据集: multi_eurlex, xsum, csebuetnlp/xlsum, mlsum, amazon_polarity, https://sinai.ujaen.es/investigacion/recursos/coah, https://sinai.ujaen.es/investigacion/recursos/coar, carblacac/twitter-sentiment-analysis, cardiffnlp/tweet_sentiment_multilingual, https://www.kaggle.com/datasets/ricardomoya/tweets-poltica-espaa, wiki_lingua
  • 许可信息: CC-BY-NC-SA-4.0

  • 引用信息:

    @inproceedings{autextification2023, title = "Overview of AuTexTification at IberLEF 2023: Detection and Attribution of Machine-Generated Text in Multiple Domains", author = "Sarvazyan, Areg Mikael and Gonz{a}lez, Jos{e} {A}ngel and Franco-Salvador, Marc and Rangel, Francisco and Chulvi, Berta and Rosso, Paolo", month = sep, year = "2023", address = "Jaén, Spain", booktitle = "Procesamiento del Lenguaje Natural", }

5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作