cplane/toxigen-data

Name: cplane/toxigen-data
Creator: cplane
Published: 2026-04-21 07:31:13
License: 暂无描述

Hugging Face2026-04-21 更新2026-04-26 收录

下载链接：

https://hf-mirror.com/datasets/cplane/toxigen-data

下载链接

链接失效反馈

官方服务：

资源简介：

--- annotations_creators: - expert-generated language_creators: - machine-generated multilinguality: - monolingual size_categories: - 100K<n<1M source_datasets: - original task_categories: - text-classification task_ids: - hate-speech-detection pretty_name: ToxiGen dataset_info: - config_name: annotated features: - name: text dtype: string - name: target_group dtype: string - name: factual? dtype: string - name: ingroup_effect dtype: string - name: lewd dtype: string - name: framing dtype: string - name: predicted_group dtype: string - name: stereotyping dtype: string - name: intent dtype: float64 - name: toxicity_ai dtype: float64 - name: toxicity_human dtype: float64 - name: predicted_author dtype: string - name: actual_method dtype: string splits: - name: test num_bytes: 364518 num_examples: 940 - name: train num_bytes: 3238381 num_examples: 8960 download_size: 768996 dataset_size: 3602899 - config_name: annotations features: - name: Input.prompt dtype: string - name: Input.text dtype: string - name: Input.time dtype: string - name: Input.generation_method dtype: string - name: Input.prompt_label dtype: string - name: Input.target_group dtype: string - name: Input.binary_prompt_label dtype: int64 - name: Answer.annotatorAge dtype: string - name: Answer.annotatorGender dtype: string - name: Answer.annotatorMinority dtype: string - name: Answer.annotatorPolitics.1 dtype: bool - name: Answer.annotatorPolitics.2 dtype: bool - name: Answer.annotatorPolitics.3 dtype: bool - name: Answer.annotatorPolitics.4 dtype: bool - name: Answer.annotatorPolitics.5 dtype: bool - name: Answer.annotatorRace dtype: string - name: Answer.factSelect dtype: string - name: Answer.framingQ dtype: string - name: Answer.inGroup.on dtype: bool - name: Answer.ingroup.1 dtype: bool - name: Answer.ingroup.2 dtype: bool - name: Answer.ingroup.3 dtype: bool - name: Answer.intent.1 dtype: bool - name: Answer.intent.2 dtype: bool - name: Answer.intent.3 dtype: bool - name: Answer.intent.4 dtype: bool - name: Answer.intent.5 dtype: bool - name: Answer.lewd.1 dtype: bool - name: Answer.lewd.2 dtype: bool - name: Answer.lewd.3 dtype: bool - name: Answer.refTarget dtype: string - name: Answer.stateFrame dtype: string - name: Answer.stateGroup dtype: string - name: Answer.stereo.1 dtype: bool - name: Answer.stereo.2 dtype: bool - name: Answer.stereo.3 dtype: bool - name: Answer.toAI.1 dtype: bool - name: Answer.toAI.2 dtype: bool - name: Answer.toAI.3 dtype: bool - name: Answer.toAI.4 dtype: bool - name: Answer.toAI.5 dtype: bool - name: Answer.toPER.1 dtype: bool - name: Answer.toPER.2 dtype: bool - name: Answer.toPER.3 dtype: bool - name: Answer.toPER.4 dtype: bool - name: Answer.toPER.5 dtype: bool - name: Answer.writer.1 dtype: bool - name: Answer.writer.2 dtype: bool - name: HashedWorkerId dtype: int64 splits: - name: train num_bytes: 21933185 num_examples: 27450 download_size: 3350653 dataset_size: 21933185 - config_name: prompts features: - name: text dtype: string splits: - name: hate_trans_1k num_bytes: 585554 num_examples: 1000 - name: neutral_black_1k num_bytes: 857769 num_examples: 1000 - name: hate_native_american_1k num_bytes: 480000 num_examples: 1000 - name: neutral_immigrant_1k num_bytes: 342243 num_examples: 1000 - name: hate_middle_east_1k num_bytes: 426551 num_examples: 1000 - name: neutral_lgbtq_1k num_bytes: 914319 num_examples: 1000 - name: neutral_women_1k num_bytes: 394963 num_examples: 1000 - name: neutral_chinese_1k num_bytes: 412062 num_examples: 1000 - name: hate_latino_1k num_bytes: 708000 num_examples: 1000 - name: hate_bisexual_1k num_bytes: 447794 num_examples: 1000 - name: hate_mexican_1k num_bytes: 675444 num_examples: 1000 - name: hate_asian_1k num_bytes: 503093 num_examples: 1000 - name: neutral_mental_disability_1k num_bytes: 556905 num_examples: 1000 - name: neutral_mexican_1k num_bytes: 483603 num_examples: 1000 - name: hate_mental_disability_1k num_bytes: 480620 num_examples: 1000 - name: neutral_bisexual_1k num_bytes: 915612 num_examples: 1000 - name: neutral_latino_1k num_bytes: 470000 num_examples: 1000 - name: hate_chinese_1k num_bytes: 384934 num_examples: 1000 - name: neutral_jewish_1k num_bytes: 649674 num_examples: 1000 - name: hate_muslim_1k num_bytes: 425760 num_examples: 1000 - name: neutral_asian_1k num_bytes: 615895 num_examples: 1000 - name: hate_physical_disability_1k num_bytes: 413643 num_examples: 1000 - name: hate_jewish_1k num_bytes: 573538 num_examples: 1000 - name: neutral_muslim_1k num_bytes: 491659 num_examples: 1000 - name: hate_immigrant_1k num_bytes: 285309 num_examples: 1000 - name: hate_black_1k num_bytes: 745295 num_examples: 1000 - name: hate_lgbtq_1k num_bytes: 577075 num_examples: 1000 - name: hate_women_1k num_bytes: 389583 num_examples: 1000 - name: neutral_middle_east_1k num_bytes: 415319 num_examples: 1000 - name: neutral_native_american_1k num_bytes: 586993 num_examples: 1000 - name: neutral_physical_disability_1k num_bytes: 458497 num_examples: 1000 download_size: 1698170 dataset_size: 16667706 - config_name: train features: - name: prompt dtype: string - name: generation dtype: string - name: generation_method dtype: string - name: group dtype: string - name: prompt_label dtype: int64 - name: roberta_prediction dtype: float64 splits: - name: train num_bytes: 169400442 num_examples: 250951 download_size: 18784380 dataset_size: 169400442 configs: - config_name: annotated default: true data_files: - split: test path: annotated/test-* - split: train path: annotated/train-* - config_name: annotations data_files: - split: train path: annotations/train-* - config_name: prompts data_files: - split: hate_trans_1k path: prompts/hate_trans_1k-* - split: neutral_black_1k path: prompts/neutral_black_1k-* - split: hate_native_american_1k path: prompts/hate_native_american_1k-* - split: neutral_immigrant_1k path: prompts/neutral_immigrant_1k-* - split: hate_middle_east_1k path: prompts/hate_middle_east_1k-* - split: neutral_lgbtq_1k path: prompts/neutral_lgbtq_1k-* - split: neutral_women_1k path: prompts/neutral_women_1k-* - split: neutral_chinese_1k path: prompts/neutral_chinese_1k-* - split: hate_latino_1k path: prompts/hate_latino_1k-* - split: hate_bisexual_1k path: prompts/hate_bisexual_1k-* - split: hate_mexican_1k path: prompts/hate_mexican_1k-* - split: hate_asian_1k path: prompts/hate_asian_1k-* - split: neutral_mental_disability_1k path: prompts/neutral_mental_disability_1k-* - split: neutral_mexican_1k path: prompts/neutral_mexican_1k-* - split: hate_mental_disability_1k path: prompts/hate_mental_disability_1k-* - split: neutral_bisexual_1k path: prompts/neutral_bisexual_1k-* - split: neutral_latino_1k path: prompts/neutral_latino_1k-* - split: hate_chinese_1k path: prompts/hate_chinese_1k-* - split: neutral_jewish_1k path: prompts/neutral_jewish_1k-* - split: hate_muslim_1k path: prompts/hate_muslim_1k-* - split: neutral_asian_1k path: prompts/neutral_asian_1k-* - split: hate_physical_disability_1k path: prompts/hate_physical_disability_1k-* - split: hate_jewish_1k path: prompts/hate_jewish_1k-* - split: neutral_muslim_1k path: prompts/neutral_muslim_1k-* - split: hate_immigrant_1k path: prompts/hate_immigrant_1k-* - split: hate_black_1k path: prompts/hate_black_1k-* - split: hate_lgbtq_1k path: prompts/hate_lgbtq_1k-* - split: hate_women_1k path: prompts/hate_women_1k-* - split: neutral_middle_east_1k path: prompts/neutral_middle_east_1k-* - split: neutral_native_american_1k path: prompts/neutral_native_american_1k-* - split: neutral_physical_disability_1k path: prompts/neutral_physical_disability_1k-* - config_name: train data_files: - split: train path: train/train-* --- # Dataset Card for ToxiGen ## Table of Contents - [Dataset Description](#dataset-description) - [Dataset Summary](#dataset-summary) - [Languages](#languages) - [Dataset Structure](#dataset-structure) - [Data Fields](#data-instances) - [Additional Information](#additional-information) - [Citation Information](#citation-information) ## Sign up for Data Access To access ToxiGen, first fill out [this form](https://forms.office.com/r/r6VXX8f8vh). ## Dataset Description - **Repository:** https://github.com/microsoft/toxigen - **Paper:** https://arxiv.org/abs/2203.09509 - **Point of Contact #1:** [Tom Hartvigsen](tomh@mit.edu) - **Point of Contact #2:** [Saadia Gabriel](skgabrie@cs.washington.edu) ### Dataset Summary This dataset is for implicit hate speech detection. All instances were generated using GPT-3 and the methods described in [our paper](https://arxiv.org/abs/2203.09509). ### Languages All text is written in English. ## Dataset Structure ### Data Fields We release TOXIGEN as a dataframe with the following fields: - **prompt** is the prompt used for **generation**. - **generation** is the TOXIGEN generated text. - **generation_method** denotes whether or not ALICE was used to generate the corresponding generation. If this value is ALICE, then ALICE was used, if it is TopK, then ALICE was not used. - **prompt_label** is the binary value indicating whether or not the prompt is toxic (1 is toxic, 0 is benign). - **group** indicates the target group of the prompt. - **roberta_prediction** is the probability predicted by our corresponding RoBERTa model for each instance. ### Citation Information ```bibtex @inproceedings{hartvigsen2022toxigen, title={ToxiGen: A Large-Scale Machine-Generated Dataset for Implicit and Adversarial Hate Speech Detection}, author={Hartvigsen, Thomas and Gabriel, Saadia and Palangi, Hamid and Sap, Maarten and Ray, Dipankar and Kamar, Ece}, booktitle={Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics}, year={2022} } ```

提供机构：

cplane

5,000+

优质数据集

54 个

任务类型

进入经典数据集