facebook/md_gender_bias

Name: facebook/md_gender_bias
Creator: facebook
Published: 2024-01-18 11:08:47
License: 暂无描述

Hugging Face2024-01-18 更新2024-06-15 收录

下载链接：

https://hf-mirror.com/datasets/facebook/md_gender_bias

下载链接

链接失效反馈

官方服务：

资源简介：

--- annotations_creators: - crowdsourced - found - machine-generated language_creators: - crowdsourced - found language: - en license: - mit multilinguality: - monolingual size_categories: - 100K<n<1M - 10K<n<100K - 1K<n<10K - 1M<n<10M - n<1K source_datasets: - extended|other-convai2 - extended|other-light - extended|other-opensubtitles - extended|other-yelp - original task_categories: - text-classification task_ids: [] paperswithcode_id: md-gender pretty_name: Multi-Dimensional Gender Bias Classification tags: - gender-bias dataset_info: - config_name: gendered_words features: - name: word_masculine dtype: string - name: word_feminine dtype: string splits: - name: train num_bytes: 4988 num_examples: 222 download_size: 232629010 dataset_size: 4988 - config_name: name_genders features: - name: name dtype: string - name: assigned_gender dtype: class_label: names: '0': M '1': F - name: count dtype: int32 splits: - name: yob1880 num_bytes: 43404 num_examples: 2000 - name: yob1881 num_bytes: 41944 num_examples: 1935 - name: yob1882 num_bytes: 46211 num_examples: 2127 - name: yob1883 num_bytes: 45221 num_examples: 2084 - name: yob1884 num_bytes: 49886 num_examples: 2297 - name: yob1885 num_bytes: 49810 num_examples: 2294 - name: yob1886 num_bytes: 51935 num_examples: 2392 - name: yob1887 num_bytes: 51458 num_examples: 2373 - name: yob1888 num_bytes: 57531 num_examples: 2651 - name: yob1889 num_bytes: 56177 num_examples: 2590 - name: yob1890 num_bytes: 58509 num_examples: 2695 - name: yob1891 num_bytes: 57767 num_examples: 2660 - name: yob1892 num_bytes: 63493 num_examples: 2921 - name: yob1893 num_bytes: 61525 num_examples: 2831 - name: yob1894 num_bytes: 63927 num_examples: 2941 - name: yob1895 num_bytes: 66346 num_examples: 3049 - name: yob1896 num_bytes: 67224 num_examples: 3091 - name: yob1897 num_bytes: 65886 num_examples: 3028 - name: yob1898 num_bytes: 71088 num_examples: 3264 - name: yob1899 num_bytes: 66225 num_examples: 3042 - name: yob1900 num_bytes: 81305 num_examples: 3730 - name: yob1901 num_bytes: 68723 num_examples: 3153 - name: yob1902 num_bytes: 73321 num_examples: 3362 - name: yob1903 num_bytes: 74019 num_examples: 3389 - name: yob1904 num_bytes: 77751 num_examples: 3560 - name: yob1905 num_bytes: 79802 num_examples: 3655 - name: yob1906 num_bytes: 79392 num_examples: 3633 - name: yob1907 num_bytes: 86342 num_examples: 3948 - name: yob1908 num_bytes: 87965 num_examples: 4018 - name: yob1909 num_bytes: 92591 num_examples: 4227 - name: yob1910 num_bytes: 101491 num_examples: 4629 - name: yob1911 num_bytes: 106787 num_examples: 4867 - name: yob1912 num_bytes: 139448 num_examples: 6351 - name: yob1913 num_bytes: 153110 num_examples: 6968 - name: yob1914 num_bytes: 175167 num_examples: 7965 - name: yob1915 num_bytes: 205921 num_examples: 9357 - name: yob1916 num_bytes: 213468 num_examples: 9696 - name: yob1917 num_bytes: 218446 num_examples: 9913 - name: yob1918 num_bytes: 229209 num_examples: 10398 - name: yob1919 num_bytes: 228656 num_examples: 10369 - name: yob1920 num_bytes: 237286 num_examples: 10756 - name: yob1921 num_bytes: 239616 num_examples: 10857 - name: yob1922 num_bytes: 237569 num_examples: 10756 - name: yob1923 num_bytes: 235046 num_examples: 10643 - name: yob1924 num_bytes: 240113 num_examples: 10869 - name: yob1925 num_bytes: 235098 num_examples: 10638 - name: yob1926 num_bytes: 230970 num_examples: 10458 - name: yob1927 num_bytes: 230004 num_examples: 10406 - name: yob1928 num_bytes: 224583 num_examples: 10159 - name: yob1929 num_bytes: 217057 num_examples: 9820 - name: yob1930 num_bytes: 216352 num_examples: 9791 - name: yob1931 num_bytes: 205361 num_examples: 9298 - name: yob1932 num_bytes: 207268 num_examples: 9381 - name: yob1933 num_bytes: 199031 num_examples: 9013 - name: yob1934 num_bytes: 202758 num_examples: 9180 - name: yob1935 num_bytes: 199614 num_examples: 9037 - name: yob1936 num_bytes: 196379 num_examples: 8894 - name: yob1937 num_bytes: 197757 num_examples: 8946 - name: yob1938 num_bytes: 199603 num_examples: 9032 - name: yob1939 num_bytes: 196979 num_examples: 8918 - name: yob1940 num_bytes: 198141 num_examples: 8961 - name: yob1941 num_bytes: 200858 num_examples: 9085 - name: yob1942 num_bytes: 208363 num_examples: 9425 - name: yob1943 num_bytes: 207940 num_examples: 9408 - name: yob1944 num_bytes: 202227 num_examples: 9152 - name: yob1945 num_bytes: 199478 num_examples: 9025 - name: yob1946 num_bytes: 214614 num_examples: 9705 - name: yob1947 num_bytes: 229327 num_examples: 10371 - name: yob1948 num_bytes: 226615 num_examples: 10241 - name: yob1949 num_bytes: 227278 num_examples: 10269 - name: yob1950 num_bytes: 227946 num_examples: 10303 - name: yob1951 num_bytes: 231613 num_examples: 10462 - name: yob1952 num_bytes: 235483 num_examples: 10646 - name: yob1953 num_bytes: 239654 num_examples: 10837 - name: yob1954 num_bytes: 242389 num_examples: 10968 - name: yob1955 num_bytes: 245652 num_examples: 11115 - name: yob1956 num_bytes: 250674 num_examples: 11340 - name: yob1957 num_bytes: 255370 num_examples: 11564 - name: yob1958 num_bytes: 254520 num_examples: 11522 - name: yob1959 num_bytes: 260051 num_examples: 11767 - name: yob1960 num_bytes: 263474 num_examples: 11921 - name: yob1961 num_bytes: 269493 num_examples: 12182 - name: yob1962 num_bytes: 270244 num_examples: 12209 - name: yob1963 num_bytes: 271872 num_examples: 12282 - name: yob1964 num_bytes: 274590 num_examples: 12397 - name: yob1965 num_bytes: 264889 num_examples: 11952 - name: yob1966 num_bytes: 269321 num_examples: 12151 - name: yob1967 num_bytes: 274867 num_examples: 12397 - name: yob1968 num_bytes: 286774 num_examples: 12936 - name: yob1969 num_bytes: 304909 num_examples: 13749 - name: yob1970 num_bytes: 328047 num_examples: 14779 - name: yob1971 num_bytes: 339657 num_examples: 15295 - name: yob1972 num_bytes: 342321 num_examples: 15412 - name: yob1973 num_bytes: 348414 num_examples: 15682 - name: yob1974 num_bytes: 361188 num_examples: 16249 - name: yob1975 num_bytes: 376491 num_examples: 16944 - name: yob1976 num_bytes: 386565 num_examples: 17391 - name: yob1977 num_bytes: 403994 num_examples: 18175 - name: yob1978 num_bytes: 405430 num_examples: 18231 - name: yob1979 num_bytes: 423423 num_examples: 19039 - name: yob1980 num_bytes: 432317 num_examples: 19452 - name: yob1981 num_bytes: 432980 num_examples: 19475 - name: yob1982 num_bytes: 437986 num_examples: 19694 - name: yob1983 num_bytes: 431531 num_examples: 19407 - name: yob1984 num_bytes: 434085 num_examples: 19506 - name: yob1985 num_bytes: 447113 num_examples: 20085 - name: yob1986 num_bytes: 460315 num_examples: 20657 - name: yob1987 num_bytes: 477677 num_examples: 21406 - name: yob1988 num_bytes: 499347 num_examples: 22367 - name: yob1989 num_bytes: 531020 num_examples: 23775 - name: yob1990 num_bytes: 552114 num_examples: 24716 - name: yob1991 num_bytes: 560932 num_examples: 25109 - name: yob1992 num_bytes: 568151 num_examples: 25427 - name: yob1993 num_bytes: 579778 num_examples: 25966 - name: yob1994 num_bytes: 580223 num_examples: 25997 - name: yob1995 num_bytes: 581949 num_examples: 26080 - name: yob1996 num_bytes: 589131 num_examples: 26423 - name: yob1997 num_bytes: 601284 num_examples: 26970 - name: yob1998 num_bytes: 621587 num_examples: 27902 - name: yob1999 num_bytes: 635355 num_examples: 28552 - name: yob2000 num_bytes: 662398 num_examples: 29772 - name: yob2001 num_bytes: 673111 num_examples: 30274 - name: yob2002 num_bytes: 679392 num_examples: 30564 - name: yob2003 num_bytes: 692931 num_examples: 31185 - name: yob2004 num_bytes: 711776 num_examples: 32048 - name: yob2005 num_bytes: 723065 num_examples: 32549 - name: yob2006 num_bytes: 757620 num_examples: 34088 - name: yob2007 num_bytes: 776893 num_examples: 34961 - name: yob2008 num_bytes: 779403 num_examples: 35079 - name: yob2009 num_bytes: 771032 num_examples: 34709 - name: yob2010 num_bytes: 756717 num_examples: 34073 - name: yob2011 num_bytes: 752804 num_examples: 33908 - name: yob2012 num_bytes: 748915 num_examples: 33747 - name: yob2013 num_bytes: 738288 num_examples: 33282 - name: yob2014 num_bytes: 737219 num_examples: 33243 - name: yob2015 num_bytes: 734183 num_examples: 33121 - name: yob2016 num_bytes: 731291 num_examples: 33010 - name: yob2017 num_bytes: 721444 num_examples: 32590 - name: yob2018 num_bytes: 708657 num_examples: 32033 download_size: 232629010 dataset_size: 43393095 - config_name: new_data features: - name: text dtype: string - name: original dtype: string - name: labels list: class_label: names: '0': ABOUT:female '1': ABOUT:male '2': PARTNER:female '3': PARTNER:male '4': SELF:female '5': SELF:male - name: class_type dtype: class_label: names: '0': about '1': partner '2': self - name: turker_gender dtype: class_label: names: '0': man '1': woman '2': nonbinary '3': prefer not to say '4': no answer - name: episode_done dtype: bool_ - name: confidence dtype: string splits: - name: train num_bytes: 369753 num_examples: 2345 download_size: 232629010 dataset_size: 369753 - config_name: funpedia features: - name: text dtype: string - name: title dtype: string - name: persona dtype: string - name: gender dtype: class_label: names: '0': gender-neutral '1': female '2': male splits: - name: train num_bytes: 3225542 num_examples: 23897 - name: validation num_bytes: 402205 num_examples: 2984 - name: test num_bytes: 396417 num_examples: 2938 download_size: 232629010 dataset_size: 4024164 - config_name: image_chat features: - name: caption dtype: string - name: id dtype: string - name: male dtype: bool_ - name: female dtype: bool_ splits: - name: train num_bytes: 1061285 num_examples: 9997 - name: validation num_bytes: 35868670 num_examples: 338180 - name: test num_bytes: 530126 num_examples: 5000 download_size: 232629010 dataset_size: 37460081 - config_name: wizard features: - name: text dtype: string - name: chosen_topic dtype: string - name: gender dtype: class_label: names: '0': gender-neutral '1': female '2': male splits: - name: train num_bytes: 1158785 num_examples: 10449 - name: validation num_bytes: 57824 num_examples: 537 - name: test num_bytes: 53126 num_examples: 470 download_size: 232629010 dataset_size: 1269735 - config_name: convai2_inferred features: - name: text dtype: string - name: binary_label dtype: class_label: names: '0': ABOUT:female '1': ABOUT:male - name: binary_score dtype: float32 - name: ternary_label dtype: class_label: names: '0': ABOUT:female '1': ABOUT:male '2': ABOUT:gender-neutral - name: ternary_score dtype: float32 splits: - name: train num_bytes: 9853669 num_examples: 131438 - name: validation num_bytes: 608046 num_examples: 7801 - name: test num_bytes: 608046 num_examples: 7801 download_size: 232629010 dataset_size: 11069761 - config_name: light_inferred features: - name: text dtype: string - name: binary_label dtype: class_label: names: '0': ABOUT:female '1': ABOUT:male - name: binary_score dtype: float32 - name: ternary_label dtype: class_label: names: '0': ABOUT:female '1': ABOUT:male '2': ABOUT:gender-neutral - name: ternary_score dtype: float32 splits: - name: train num_bytes: 10931355 num_examples: 106122 - name: validation num_bytes: 679692 num_examples: 6362 - name: test num_bytes: 1375745 num_examples: 12765 download_size: 232629010 dataset_size: 12986792 - config_name: opensubtitles_inferred features: - name: text dtype: string - name: binary_label dtype: class_label: names: '0': ABOUT:female '1': ABOUT:male - name: binary_score dtype: float32 - name: ternary_label dtype: class_label: names: '0': ABOUT:female '1': ABOUT:male '2': ABOUT:gender-neutral - name: ternary_score dtype: float32 splits: - name: train num_bytes: 27966476 num_examples: 351036 - name: validation num_bytes: 3363802 num_examples: 41957 - name: test num_bytes: 3830528 num_examples: 49108 download_size: 232629010 dataset_size: 35160806 - config_name: yelp_inferred features: - name: text dtype: string - name: binary_label dtype: class_label: names: '0': ABOUT:female '1': ABOUT:male - name: binary_score dtype: float32 splits: - name: train num_bytes: 260582945 num_examples: 2577862 - name: validation num_bytes: 324349 num_examples: 4492 - name: test num_bytes: 53887700 num_examples: 534460 download_size: 232629010 dataset_size: 314794994 config_names: - convai2_inferred - funpedia - gendered_words - image_chat - light_inferred - name_genders - new_data - opensubtitles_inferred - wizard - yelp_inferred --- # Dataset Card for Multi-Dimensional Gender Bias Classification ## Table of Contents - [Dataset Description](#dataset-description) - [Dataset Summary](#dataset-summary) - [Supported Tasks and Leaderboards](#supported-tasks-and-leaderboards) - [Languages](#languages) - [Dataset Structure](#dataset-structure) - [Data Instances](#data-instances) - [Data Fields](#data-fields) - [Data Splits](#data-splits) - [Dataset Creation](#dataset-creation) - [Curation Rationale](#curation-rationale) - [Source Data](#source-data) - [Annotations](#annotations) - [Personal and Sensitive Information](#personal-and-sensitive-information) - [Considerations for Using the Data](#considerations-for-using-the-data) - [Social Impact of Dataset](#social-impact-of-dataset) - [Discussion of Biases](#discussion-of-biases) - [Other Known Limitations](#other-known-limitations) - [Additional Information](#additional-information) - [Dataset Curators](#dataset-curators) - [Licensing Information](#licensing-information) - [Citation Information](#citation-information) - [Contributions](#contributions) ## Dataset Description - **Homepage:** [ParlAI MD Gender Project Page](https://parl.ai/projects/md_gender/) - **Repository:** [ParlAI Github MD Gender Repository](https://github.com/facebookresearch/ParlAI/tree/master/projects/md_gender) - **Paper:** [Multi-Dimensional Gender Bias Classification](https://www.aclweb.org/anthology/2020.emnlp-main.23.pdf) - **Leaderboard:** [Needs More Information] - **Point of Contact:** edinan@fb.com ### Dataset Summary The Multi-Dimensional Gender Bias Classification dataset is based on a general framework that decomposes gender bias in text along several pragmatic and semantic dimensions: bias from the gender of the person being spoken about, bias from the gender of the person being spoken to, and bias from the gender of the speaker. It contains seven large scale datasets automatically annotated for gender information (there are eight in the original project but the Wikipedia set is not included in the HuggingFace distribution), one crowdsourced evaluation benchmark of utterance-level gender rewrites, a list of gendered names, and a list of gendered words in English. ### Supported Tasks and Leaderboards - `text-classification-other-gender-bias`: The dataset can be used to train a model for classification of various kinds of gender bias. The model performance is evaluated based on the accuracy of the predicted labels as compared to the given labels in the dataset. Dinan et al's (2020) Transformer model achieved an average of 67.13% accuracy in binary gender prediction across the ABOUT, TO, and AS tasks. See the paper for more results. ### Languages The data is in English as spoken on the various sites where the data was collected. The associated BCP-47 code `en`. ## Dataset Structure ### Data Instances The following are examples of data instances from the various configs in the dataset. See the [MD Gender Bias dataset viewer](https://huggingface.co/datasets/viewer/?dataset=md_gender_bias) to explore more examples. An example from the `new_data` config: ``` {'class_type': 0, 'confidence': 'certain', 'episode_done': True, 'labels': [1], 'original': 'She designed monumental Loviisa war cemetery in 1920', 'text': 'He designed monumental Lovissa War Cemetery in 1920.', 'turker_gender': 4} ``` An example from the `funpedia` config: ``` {'gender': 2, 'persona': 'Humorous', 'text': 'Max Landis is a comic book writer who wrote Chronicle, American Ultra, and Victor Frankestein.', 'title': 'Max Landis'} ``` An example from the `image_chat` config: ``` {'caption': '<start> a young girl is holding a pink umbrella in her hand <eos>', 'female': True, 'id': '2923e28b6f588aff2d469ab2cccfac57', 'male': False} ``` An example from the `wizard` config: ``` {'chosen_topic': 'Krav Maga', 'gender': 2, 'text': 'Hello. I hope you might enjoy or know something about Krav Maga?'} ``` An example from the `convai2_inferred` config (the other `_inferred` configs have the same fields, with the exception of `yelp_inferred`, which does not have the `ternary_label` or `ternary_score` fields): ``` {'binary_label': 1, 'binary_score': 0.6521999835968018, 'ternary_label': 2, 'ternary_score': 0.4496000111103058, 'text': "hi , how are you doing ? i'm getting ready to do some cheetah chasing to stay in shape ."} ``` An example from the `gendered_words` config: ``` {'word_feminine': 'countrywoman', 'word_masculine': 'countryman'} ``` An example from the `name_genders` config: ``` {'assigned_gender': 1, 'count': 7065, 'name': 'Mary'} ``` ### Data Fields The following are the features for each of the configs. For the `new_data` config: - `text`: the text to be classified - `original`: the text before reformulation - `labels`: a `list` of classification labels, with possible values including `ABOUT:female`, `ABOUT:male`, `PARTNER:female`, `PARTNER:male`, `SELF:female`. - `class_type`: a classification label, with possible values including `about` (0), `partner` (1), `self` (2). - `turker_gender`: a classification label, with possible values including `man` (0), `woman` (1), `nonbinary` (2), `prefer not to say` (3), `no answer` (4). - `episode_done`: a boolean indicating whether the conversation was completed. - `confidence`: a string indicating the confidence of the annotator in response to the instance label being ABOUT/TO/AS a man or woman. Possible values are `certain`, `pretty sure`, and `unsure`. For the `funpedia` config: - `text`: the text to be classified. - `gender`: a classification label, with possible values including `gender-neutral` (0), `female` (1), `male` (2), indicating the gender of the person being talked about. - `persona`: a string describing the persona assigned to the user when talking about the entity. - `title`: a string naming the entity the text is about. For the `image_chat` config: - `caption`: a string description of the contents of the original image. - `female`: a boolean indicating whether the gender of the person being talked about is female, if the image contains a person. - `id`: a string indicating the id of the image. - `male`: a boolean indicating whether the gender of the person being talked about is male, if the image contains a person. For the `wizard` config: - `text`: the text to be classified. - `chosen_topic`: a string indicating the topic of the text. - `gender`: a classification label, with possible values including `gender-neutral` (0), `female` (1), `male` (2), indicating the gender of the person being talked about. For the `_inferred` configurations (again, except the `yelp_inferred` split, which does not have the `ternary_label` or `ternary_score` fields): - `text`: the text to be classified. - `binary_label`: a classification label, with possible values including `ABOUT:female`, `ABOUT:male`. - `binary_score`: a float indicating a score between 0 and 1. - `ternary_label`: a classification label, with possible values including `ABOUT:female`, `ABOUT:male`, `ABOUT:gender-neutral`. - `ternary_score`: a float indicating a score between 0 and 1. For the word list: - `word_masculine`: a string indicating the masculine version of the word. - `word_feminine`: a string indicating the feminine version of the word. For the gendered name list: - `assigned_gender`: an integer, 1 for female, 0 for male. - `count`: an integer. - `name`: a string of the name. ### Data Splits The different parts of the data can be accessed through the different configurations: - `gendered_words`: A list of common nouns with a masculine and feminine variant. - `new_data`: Sentences reformulated and annotated along all three axes. - `funpedia`, `wizard`: Sentences from Funpedia and Wizards of Wikipedia annotated with ABOUT gender with entity gender information. - `image_chat`: sentences about images annotated with ABOUT gender based on gender information from the entities in the image - `convai2_inferred`, `light_inferred`, `opensubtitles_inferred`, `yelp_inferred`: Data from several source datasets with ABOUT annotations inferred by a trined classifier. | Split | M | F | N | U | Dimension | | ---------- | ---- | --- | ---- | ---- | --------- | | Image Chat | 39K | 15K | 154K | - | ABOUT | | Funpedia | 19K | 3K | 1K | - | ABOUT | | Wizard | 6K | 1K | 1K | - | ABOUT | | Yelp | 1M | 1M | - | - | AS | | ConvAI2 | 22K | 22K | - | 86K | AS | | ConvAI2 | 22K | 22K | - | 86K | TO | | OpenSub | 149K | 69K | - | 131K | AS | | OpenSub | 95K | 45K | - | 209K | TO | | LIGHT | 13K | 8K | - | 83K | AS | | LIGHT | 13K | 8K | - | 83K | TO | | ---------- | ---- | --- | ---- | ---- | --------- | | MDGender | 384 | 401 | - | - | ABOUT | | MDGender | 396 | 371 | - | - | AS | | MDGender | 411 | 382 | - | - | TO | ## Dataset Creation ### Curation Rationale The curators chose to annotate the existing corpora to make their classifiers reliable on all dimensions (ABOUT/TO/AS) and across multiple domains. However, none of the existing datasets cover all three dimensions at the same time, and many of the gender labels are noisy. To enable reliable evaluation, the curators collected a specialized corpus, found in the `new_data` config, which acts as a gold-labeled dataset for the masculine and feminine classes. ### Source Data #### Initial Data Collection and Normalization For the `new_data` config, the curators collected conversations between two speakers. Each speaker was provided with a persona description containing gender information, then tasked with adopting that persona and having a conversation. They were also provided with small sections of a biography from Wikipedia as the conversation topic in order to encourage crowdworkers to discuss ABOUT/TO/AS gender information. To ensure there is ABOUT/TO/AS gender information contained in each utterance, the curators asked a second set of annotators to rewrite each utterance to make it very clear that they are speaking ABOUT a man or a woman, speaking AS a man or a woman, and speaking TO a man or a woman. #### Who are the source language producers? This dataset was collected from crowdworkers from Amazon’s Mechanical Turk. All workers are English-speaking and located in the United States. | Reported Gender | Percent of Total | | ----------------- | ---------------- | | Man | 67.38 | | Woman | 18.34 | | Non-binary | 0.21 | | Prefer not to say | 14.07 | ### Annotations #### Annotation process For the `new_data` config, annotators were asked to label how confident they are that someone else could predict the given gender label, allowing for flexibility between explicit genderedness (like the use of "he" or "she") and statistical genderedness. Many of the annotated datasets contain cases where the ABOUT, AS, TO labels are not provided (i.e. unknown). In such instances, the curators apply one of two strategies. They apply the imputation strategy for data for which the ABOUT label is unknown using a classifier trained only on other Wikipedia data for which this label is provided. Data without a TO or AS label was assigned one at random, choosing between masculine and feminine with equal probability. Details of how each of the eight training datasets was annotated are as follows: 1. Wikipedia- to annotate ABOUT, the curators used a Wikipedia dump and extract biography pages using named entity recognition. They labeled pages with a gender based on the number of gendered pronouns (he vs. she vs. they) and labeled each paragraph in the page with this label for the ABOUT dimension. 2. Funpedia- Funpedia ([Miller et al., 2017](https://www.aclweb.org/anthology/D17-2014/)) contains rephrased Wikipedia sentences in a more conversational way. The curators retained only biography related sentences and annotate similar to Wikipedia, to give ABOUT labels. 3. Wizard of Wikipedia- [Wizard of Wikipedia](https://parl.ai/projects/wizard_of_wikipedia/) contains two people discussing a topic in Wikipedia. The curators retain only the conversations on Wikipedia biographies and annotate to create ABOUT labels. 4. ImageChat- [ImageChat](https://klshuster.github.io/image_chat/) contains conversations discussing the contents of an image. The curators used the [Xu et al. image captioning system](https://github.com/AaronCCWong/Show-Attend-and-Tell) to identify the contents of an image and select gendered examples. 5. Yelp- The curators used the Yelp reviewer gender predictor developed by ([Subramanian et al., 2018](https://arxiv.org/pdf/1811.00552.pdf)) and retain reviews for which the classifier is very confident – this creates labels for the content creator of the review (AS). They impute ABOUT labels on this dataset using a classifier trained on the datasets 1-4. 6. ConvAI2- [ConvAI2](https://parl.ai/projects/convai2/) contains persona-based conversations. Many personas contain sentences such as 'I am a old woman' or 'My name is Bob' which allows annotators to annotate the gender of the speaker (AS) and addressee (TO) with some confidence. Many of the personas have unknown gender. The curators impute ABOUT labels on this dataset using a classifier trained on the datasets 1-4. 7. OpenSubtitles- [OpenSubtitles](http://www.opensubtitles.org/) contains subtitles for movies in different languages. The curators retained English subtitles that contain a character name or identity. They annotated the character’s gender using gender kinship terms such as daughter and gender probability distribution calculated by counting the masculine and feminine names of baby names in the United States. Using the character’s gender, they produced labels for the AS dimension. They produced labels for the TO dimension by taking the gender of the next character to speak if there is another utterance in the conversation; otherwise, they take the gender of the last character to speak. They impute ABOUT labels on this dataset using a classifier trained on the datasets 1-4. 8. LIGHT- [LIGHT](https://parl.ai/projects/light/) contains persona-based conversation. Similarly to ConvAI2, annotators labeled the gender of each persona, giving labels for the speaker (AS) and speaking partner (TO). The curators impute ABOUT labels on this dataset using a classifier trained on the datasets 1-4. #### Who are the annotators? This dataset was annotated by crowdworkers from Amazon’s Mechanical Turk. All workers are English-speaking and located in the United States. ### Personal and Sensitive Information For privacy reasons the curators did not associate the self-reported gender of the annotator with the labeled examples in the dataset and only report these statistics in aggregate. ## Considerations for Using the Data ### Social Impact of Dataset This dataset is intended for applications such as controlling for gender bias in generative models, detecting gender bias in arbitrary text, and classifying text as offensive based on its genderedness. ### Discussion of Biases Over two thirds of annotators identified as men, which may introduce biases into the dataset. Wikipedia is also well known to have gender bias in equity of biographical coverage and lexical bias in noun references to women (see the paper's appendix for citations). ### Other Known Limitations The limitations of the Multi-Dimensional Gender Bias Classification dataset have not yet been investigated, but the curators acknowledge that more work is required to address the intersectionality of gender identities, i.e., when gender non-additively interacts with other identity characteristics. The curators point out that negative gender stereotyping is known to be alternatively weakened or reinforced by the presence of social attributes like dialect, class and race and that these differences have been found to affect gender classification in images and sentences encoders. See the paper for references. ## Additional Information ### Dataset Curators Emily Dinan, Angela Fan, Ledell Wu, Jason Weston, Douwe Kiela, and Adina Williams at Facebook AI Research. Angela Fan is also affiliated with Laboratoire Lorrain d’Informatique et Applications (LORIA). ### Licensing Information The Multi-Dimensional Gender Bias Classification dataset is licensed under the [MIT License](https://opensource.org/licenses/MIT). ### Citation Information ``` @inproceedings{dinan-etal-2020-multi, title = "Multi-Dimensional Gender Bias Classification", author = "Dinan, Emily and Fan, Angela and Wu, Ledell and Weston, Jason and Kiela, Douwe and Williams, Adina", booktitle = "Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)", month = nov, year = "2020", address = "Online", publisher = "Association for Computational Linguistics", url = "https://www.aclweb.org/anthology/2020.emnlp-main.23", doi = "10.18653/v1/2020.emnlp-main.23", pages = "314--331", abstract = "Machine learning models are trained to find patterns in data. NLP models can inadvertently learn socially undesirable patterns when training on gender biased text. In this work, we propose a novel, general framework that decomposes gender bias in text along several pragmatic and semantic dimensions: bias from the gender of the person being spoken about, bias from the gender of the person being spoken to, and bias from the gender of the speaker. Using this fine-grained framework, we automatically annotate eight large scale datasets with gender information. In addition, we collect a new, crowdsourced evaluation benchmark. Distinguishing between gender bias along multiple dimensions enables us to train better and more fine-grained gender bias classifiers. We show our classifiers are valuable for a variety of applications, like controlling for gender bias in generative models, detecting gender bias in arbitrary text, and classifying text as offensive based on its genderedness.", } ``` ### Contributions Thanks to [@yjernite](https://github.com/yjernite) and [@mcmillanmajora](https://github.com/mcmillanmajora)for adding this dataset.

提供机构：

facebook

原始信息汇总

数据集概述

数据集摘要

名称： Multi-Dimensional Gender Bias Classification

语言： 英语（en）

许可证： MIT

多语言性： 单语种（monolingual）

大小分类：

n<1K
1K<n<10K
10K<n<100K
100K<n<1M
1M<n<10M

源数据集：

extended|other-convai2
extended|other-light
extended|other-opensubtitles
extended|other-yelp
original

任务类别： 文本分类（text-classification）

标签： gender-bias

数据集结构

配置名称及特征

gendered_words
- word_masculine: 男性词汇，类型为字符串（string）
- word_feminine: 女性词汇，类型为字符串（string）
name_genders
- name: 名字，类型为字符串（string）
- assigned_gender: 分配的性别，类型为分类标签（class_label），可能值为 M 和 F
- count: 计数，类型为整数（int32）
new_data
- text: 待分类的文本，类型为字符串（string）
- original: 重构前的文本，类型为字符串（string）
- labels: 分类标签列表，可能值包括 ABOUT:female, ABOUT:male, PARTNER:female, PARTNER:male, SELF:female, SELF:male
- class_type: 分类标签，可能值包括 about, partner, self
- turker_gender: 标注者的性别，类型为分类标签（class_label），可能值包括 man, woman, nonbinary, prefer not to say, no answer
- episode_done: 是否完成，类型为布尔值（bool_）
- confidence: 置信度，类型为字符串（string）
funpedia
- text: 文本，类型为字符串（string）
- title: 标题，类型为字符串（string）
- persona: 角色，类型为字符串（string）
- gender: 性别，类型为分类标签（class_label），可能值包括 gender-neutral, female, male
image_chat
- caption: 描述，类型为字符串（string）
- id: 标识符，类型为字符串（string）
- male: 是否为男性，类型为布尔值（bool_）
- female: 是否为女性，类型为布尔值（bool_）
wizard
- text: 文本，类型为字符串（string）
- chosen_topic: 选定的话题，类型为字符串（string）
- gender: 性别，类型为分类标签（class_label），可能值包括 gender-neutral, female, male
convai2_inferred
- text: 文本，类型为字符串（string）
- binary_label: 二元分类标签，类型为分类标签（class_label），可能值包括 ABOUT:female, ABOUT:male
- binary_score: 二元分类得分，类型为浮点数（float32）
- ternary_label: 三元分类标签，类型为分类标签（class_label），可能值包括 ABOUT:female, ABOUT:male, ABOUT:gender-neutral
- ternary_score: 三元分类得分，类型为浮点数（float32）
light_inferred
- text: 文本，类型为字符串（string）
- binary_label: 二元分类标签，类型为分类标签（class_label），可能值包括 ABOUT:female, ABOUT:male
- binary_score: 二元分类得分，类型为浮点数（float32）
- ternary_label: 三元分类标签，类型为分类标签（class_label），可能值包括 ABOUT:female, ABOUT:male, ABOUT:gender-neutral
- ternary_score: 三元分类得分，类型为浮点数（float32）
opensubtitles_inferred
- text: 文本，类型为字符串（string）
- binary_label: 二元分类标签，类型为分类标签（class_label），可能值包括 ABOUT:female, ABOUT:male
- binary_score: 二元分类得分，类型为浮点数（float32）
- ternary_label: 三元分类标签，类型为分类标签（class_label），可能值包括 ABOUT:female, ABOUT:male, ABOUT:gender-neutral
- ternary_score: 三元分类得分，类型为浮点数（float32）
yelp_inferred
- text: 文本，类型为字符串（string）
- binary_label: 二元分类标签，类型为分类标签（class_label），可能值包括 ABOUT:female, ABOUT:male
- binary_score: 二元分类得分，类型为浮点数（float32）

数据分割

gendered_words

train
- 字节数：4988
- 样本数：222

name_genders

yob1880 至 yob2018
- 每年包含不同的字节数和样本数，具体数值详见数据集详情页面。

new_data

train
- 字节数：369753
- 样本数：2345

funpedia

train
- 字节数：3225542
- 样本数：23897
validation
- 字节数：402205
- 样本数：2984
test
- 字节数：396417
- 样本数：2938

image_chat

train
- 字节数：1061285
- 样本数：9997
validation
- 字节数：35868670
- 样本数：338180
test
- 字节数：530126
- 样本数：5000

wizard

train
- 字节数：1158785
- 样本数：10449
validation
- 字节数：57824
- 样本数：537
test
- 字节数：53126
- 样本数：470

convai2_inferred

train
- 字节数：9853669
- 样本数：131438
validation
- 字节数：608046
- 样本数：7801
test
- 字节数：608046
- 样本数：7801

light_inferred

train
- 字节数：10931355
- 样本数：106122
validation
- 字节数：679692
- 样本数：6362
test
- 字节数：1375745
- 样本数：12765

opensubtitles_inferred

train
- 字节数：27966476
- 样本数：351036
validation
- 字节数：3363802
- 样本数：41957
test
- 字节数：3830528
- 样本数：49108

yelp_inferred

train
- 字节数：260582945
- 样本数：2577862
validation
- 字节数：324349
- 样本数：4492
test
- 字节数：53887700
- 样本数：534460

搜集汇总

数据集介绍

构建方式

该数据集的构建基于一个通用框架，该框架将文本中的性别偏见分解为几个语用和语义维度：被谈论者的性别偏见、被对话者的性别偏见以及说话者的性别偏见。数据集包含七个大规模的自动性别信息标注数据集（原项目中有八个，但HuggingFace版本中不包括Wikipedia数据集），一个众包的语句级性别重写评估基准，一个英文中的性别化名字列表，以及一个性别化单词列表。

特点

数据集的特点在于其多维度的性别偏见分类，涵盖了被谈论者、被对话者和说话者三个主要维度。此外，数据集还包括了性别化名字和单词的列表，以及一个众包的性别重写评估基准，这些都为研究性别偏见提供了丰富的资源。

使用方法

数据集可用于训练和评估性别偏见分类模型。用户可以通过加载不同的配置（如new_data、funpedia、image_chat等）来访问不同类型的数据。每个配置包含特定的数据字段和标签，用户可以根据需要选择合适的配置进行模型训练和测试。

背景与挑战

背景概述

多维性别偏见分类数据集（Multi-Dimensional Gender Bias Classification Dataset）由Facebook研究团队创建，旨在通过分解文本中的性别偏见，涵盖多个语用和语义维度，包括被谈论者的性别偏见、被对话者的性别偏见以及说话者的性别偏见。该数据集包含七个大规模自动标注性别信息的语料库，一个众包评估基准，以及英语中的性别词汇和名字列表。其核心研究问题在于如何准确识别和分类文本中的多维性别偏见，这对于推动自然语言处理领域的公平性和包容性具有重要意义。

当前挑战

该数据集面临的挑战包括：1) 如何准确区分和标注不同语境下的性别偏见，尤其是在多语言和跨文化背景下；2) 数据集的构建过程中，如何确保标注的一致性和可靠性，特别是在众包和机器生成标注的情况下；3) 如何处理和减少数据中的潜在偏见，以确保模型的公平性和无偏性。此外，数据集的使用还需考虑其对社会的影响，特别是在性别平等和多样性方面的潜在影响。

常用场景

经典使用场景

该数据集的经典使用场景主要集中在性别偏见的多维度分类任务上。通过分析文本中的性别信息，研究者可以识别和量化文本中存在的性别偏见，包括说话者、被提及者和对话对象的性别偏见。这种多维度的分析有助于深入理解性别偏见在不同语境中的表现形式。

衍生相关工作

基于该数据集，研究者已经开发了多种性别偏见检测和纠正模型，这些模型在多个自然语言处理任务中表现出色。此外，该数据集还激发了关于如何更全面地理解和处理性别偏见的讨论，推动了相关领域的研究进展。

数据集最近研究