five

almanach/hc3_french_ood

收藏
Hugging Face2023-06-05 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/almanach/hc3_french_ood
下载链接
链接失效反馈
官方服务:
资源简介:
--- task_categories: - text-classification - question-answering - sentence-similarity - zero-shot-classification language: - en - fr size_categories: - 10K<n<100K tags: - ChatGPT - Bing - LM Detection - Detection - OOD license: cc-by-sa-4.0 --- Dataset card for the dataset used in : ## Towards a Robust Detection of Language Model-Generated Text: Is ChatGPT that easy to detect? Paper: https://gitlab.inria.fr/wantoun/robust-chatgpt-detection/-/raw/main/towards_chatgpt_detection.pdf Source Code: https://gitlab.inria.fr/wantoun/robust-chatgpt-detection ## Dataset Summary #### overview: This dataset is made of two parts: - First, an extension of the [Human ChatGPT Comparison Corpus (HC3) dataset](https://huggingface.co/datasets/Hello-SimpleAI/HC3) with French data automatically translated from the English source. - Second, out-of-domain and adversarial French data set have been gathereed (Human adversarial, BingGPT, Native French ChatGPT responses). #### Details: - We first format the data into three subsets: `sentence`, `question` and `full` following the original paper. - We then extend the data by translating the English questions and answers to French. - We provide native French ChatGPT responses to a sample of the translated questions. - We added a subset with QA pairs from BingGPT - We included an adversarial subset with human-written answers in the style of conversational LLMs like Bing/ChatGPT. ## Available Subsets ### Out-of-domain: - `hc3_fr_qa_chatgpt`: Translated French questions and native French ChatGPT answers pairs from HC3. This is the `ChatGPT-Native` subset from the paper. - Features: `id`, `question`, `answer`, `chatgpt_answer`, `label`, `source` - Size: - test: `113` examples, `25592` words - `qa_fr_binggpt`: French questions and BingGPT answers pairs. This is the `BingGPT` subset from the paper. - Features: `id`, `question`, `answer`, `label`, `deleted_clues`, `deleted_sources`, `remarks` - Size: - test: `106` examples, `26291` words - `qa_fr_binglikehuman`: French questions and human written BingGPT-like answers pairs. This is the `Adversarial` subset from the paper. - Features: `id`, `question`, `answer`, `label`, `source` - Size: - test: `61` examples, `17328` words - `faq_fr_gouv`: French FAQ questions and answers pairs from domain ending with `.gouv` from the MQA dataset (subset 'fr-faq-page'). https://huggingface.co/datasets/clips/mqa. This is the `FAQ-Gouv` subset from the paper. - Features: `id`, `page_id`, `question_id`, `answer_id`, `bucket`, `domain`, `question`, `answer`, `label` - Size: - test: `235` examples, `22336` words - `faq_fr_random`: French FAQ questions and answers pairs from random domain from the MQA dataset (subset 'fr-faq-page'). https://huggingface.co/datasets/clips/mqa. This is the `FAQ-Rand` subset from the paper. - Features: `id`, `page_id`, `question_id`, `answer_id`, `bucket`, `domain`, `question`, `answer`, `label` - Size: - test: `4454` examples, `271823` words ### In-domain: - `hc3_en_qa`: English questions and answers pairs from HC3. - Features: `id`, `question`, `answer`, `label`, `source` - Size: - train: `68335` examples, `12306363` words - validation: `17114` examples, `3089634` words - test: `710` examples, `117001` words - `hc3_en_sentence`: English answers split into sentences from HC3. - Features: `id`, `text`, `label`, `source` - Size: - train: `455320` examples, `9983784` words - validation: `113830` examples, `2510290` words - test: `4366` examples, `99965` words - `hc3_en_full`: English questions and answers pairs concatenated from HC3. - Features: `id`, `text`, `label`, `source` - Size: - train: `68335` examples, `9982863` words - validation: `17114` examples, `2510058` words - test: `710` examples, `99926` words - `hc3_fr_qa`: Translated French questions and answers pairs from HC3. - Features: `id`, `question`, `answer`, `label`, `source` - Size: - train: `68283` examples, `12660717` words - validation: `17107` examples, `3179128` words - test: `710` examples, `127193` words - `hc3_fr_sentence`: Translated French answers split into sentences from HC3. - Features: `id`, `text`, `label`, `source` - Size: - train: `464885` examples, `10189606` words - validation: `116524` examples, `2563258` words - test: `4366` examples, `108374` words - `hc3_fr_full`: Translated French questions and answers pairs concatenated from HC3. - Features: `id`, `text`, `label`, `source` - Size: - train: `68283` examples, `10188669` words - validation: `17107` examples, `2563037` words - test: `710` examples, `108352` words ## How to load ```python from datasets import load_dataset dataset = load_dataset("almanach/hc3_multi", "hc3_fr_qa") ``` ## Dataset Copyright If the source datasets used in this corpus has a specific license which is stricter than CC-BY-SA, our products follow the same. If not, they follow CC-BY-SA license. | English Split | Source | Source License | Note | |----------|-------------|--------|-------------| | reddit_eli5 | [ELI5](https://github.com/facebookresearch/ELI5) | BSD License | | | open_qa | [WikiQA](https://www.microsoft.com/en-us/download/details.aspx?id=52419) | [PWC Custom](https://paperswithcode.com/datasets/license) | | | wiki_csai | Wikipedia | CC-BY-SA | | [Wiki FAQ](https://en.wikipedia.org/wiki/Wikipedia:FAQ/Copyright) | | medicine | [Medical Dialog](https://github.com/UCSD-AI4H/Medical-Dialogue-System) | Unknown| [Asking](https://github.com/UCSD-AI4H/Medical-Dialogue-System/issues/10)| | finance | [FiQA](https://paperswithcode.com/dataset/fiqa-1) | Unknown | Asking by 📧 | | FAQ | [MQA]( https://huggingface.co/datasets/clips/mqa) | CC0 1.0| | | ChatGPT/BingGPT | | Unknown | This is ChatGPT/BingGPT generated data. | | Human | | CC-BY-SA | | ## Citation ```bibtex @proceedings{towards-a-robust-2023-antoun, title = "Towards a Robust Detection of Language Model-Generated Text: Is ChatGPT that easy to detect?", editor = "Antoun, Wissam and Mouilleron, Virginie and Sagot, Benoit and Seddah, Djam{\'e}", month = "6", year = "2023", address = "Paris, France", publisher = "ATALA", url = "https://gitlab.inria.fr/wantoun/robust-chatgpt-detection/-/raw/main/towards_chatgpt_detection.pdf", } ``` ```bibtex @article{guo-etal-2023-hc3, title = "How Close is ChatGPT to Human Experts? Comparison Corpus, Evaluation, and Detection", author = "Guo, Biyang and Zhang, Xin and Wang, Ziyuan and Jiang, Minqi and Nie, Jinran and Ding, Yuxuan and Yue, Jianwei and Wu, Yupeng", journal={arXiv preprint arxiv:2301.07597} year = "2023", url ="https://arxiv.org/abs/2301.07597" } ```
提供机构:
almanach
原始信息汇总

数据集概述

任务类别

  • 文本分类
  • 问答
  • 句子相似度
  • 零样本分类

语言

  • 英语
  • 法语

大小类别

  • 10K<n<100K

标签

  • ChatGPT
  • Bing
  • LM Detection
  • Detection
  • OOD

许可证

  • cc-by-sa-4.0

数据集组成

概览

详细信息

  • 数据被格式化为三个子集:sentencequestionfull
  • 扩展数据通过将英语问题和答案翻译成法语。
  • 提供原生法语ChatGPT对翻译问题的响应。
  • 添加了来自BingGPT的QA对子集。
  • 包含一个人类编写的对抗性子集,模仿Bing/ChatGPT风格。

可用子集

外域子集

  • hc3_fr_qa_chatgpt: 法语问题与原生法语ChatGPT答案对。
    • 特征: id, question, answer, chatgpt_answer, label, source
    • 大小: 测试 - 113 例子, 25592 单词
  • qa_fr_binggpt: 法语问题与BingGPT答案对。
    • 特征: id, question, answer, label, deleted_clues, deleted_sources, remarks
    • 大小: 测试 - 106 例子, 26291 单词
  • qa_fr_binglikehuman: 法语问题与人类编写的BingGPT风格答案对。
    • 特征: id, question, answer, label, source
    • 大小: 测试 - 61 例子, 17328 单词
  • faq_fr_gouv: 法语FAQ问题与答案对,来自.gouv域。
    • 特征: id, page_id, question_id, answer_id, bucket, domain, question, answer, label
    • 大小: 测试 - 235 例子, 22336 单词
  • faq_fr_random: 法语FAQ问题与答案对,来自随机域。
    • 特征: id, page_id, question_id, answer_id, bucket, domain, question, answer, label
    • 大小: 测试 - 4454 例子, 271823 单词

域内子集

  • hc3_en_qa: 英语问题与答案对。
    • 特征: id, question, answer, label, source
    • 大小: 训练 - 68335 例子, 12306363 单词; 验证 - 17114 例子, 3089634 单词; 测试 - 710 例子, 117001 单词
  • hc3_en_sentence: 英语答案分割成句子。
    • 特征: id, text, label, source
    • 大小: 训练 - 455320 例子, 9983784 单词; 验证 - 113830 例子, 2510290 单词; 测试 - 4366 例子, 99965 单词
  • hc3_en_full: 英语问题与答案对连接。
    • 特征: id, text, label, source
    • 大小: 训练 - 68335 例子, 9982863 单词; 验证 - 17114 例子, 2510058 单词; 测试 - 710 例子, 99926 单词
  • hc3_fr_qa: 法语问题与答案对。
    • 特征: id, question, answer, label, source
    • 大小: 训练 - 68283 例子, 12660717 单词; 验证 - 17107 例子, 3179128 单词; 测试 - 710 例子, 127193 单词
  • hc3_fr_sentence: 法语答案分割成句子。
    • 特征: id, text, label, source
    • 大小: 训练 - 464885 例子, 10189606 单词; 验证 - 116524 例子, 2563258 单词; 测试 - 4366 例子, 108374 单词
  • hc3_fr_full: 法语问题与答案对连接。
    • 特征: id, text, label, source
    • 大小: 训练 - 68283 例子, 10188669 单词; 验证 - 17107 例子, 2563037 单词; 测试 - 710 例子, 108352 单词
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作