BigScienceBiasEval/bias-shades
收藏Hugging Face2024-04-11 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/BigScienceBiasEval/bias-shades
下载链接
链接失效反馈官方服务:
资源简介:
---
license: cc-by-sa-4.0
language:
- ar
- en
- fr
- de
- hi
- ru
- es
- ta
---
Possibly a placeholder dataset for the original here: https://huggingface.co/datasets/bigscience-catalogue-data/bias-shades
# Data Statement for SHADES
> **How to use this document:**
> Fill in each section according to the instructions. Give as much detail as you can, but there's no need to extrapolate. The goal is to help people understand your data when they approach it. This could be someone looking at it in ten years, or it could be you yourself looking back at the data in two years.
> For full details, the best source is the original Data Statements paper, here: https://www.aclweb.org/anthology/Q18-1041/ .
> Instruction fields are given as blockquotes; delete the instructions when you're done, and provide the file with your data, for example as "DATASTATEMENT.md". The lists in some blocks are designed to be filled in, but it's good to also leave a written description of what's happening, as well as the list. It's fine to skip some fields if the information isn't known.
> Only blockquoted content should be deleted; the final about statement should be left intact.
Data set name: Bias-Shades
Citation (if available): TODO.
Data set developer(s): This dataset was compiled by dozens of research scientists through the BigScience open science collaboration. Collaborators, representing numerous cultures and languages, joined the project of their own volition.
Data statement author(s): Shayne Longpre, Aurélie Névéol, Shanya Sharma[Add name here if you add/edit the data statement :)].
Others who contributed to this document: N/A
License: Creative Commons Attribution-ShareAlike 4.0 (CC BY-SA 4.0).
## A. CURATION RATIONALE
> *Explanation.* Which texts were included and what were the goals in selecting texts, both in the original collection and in any further sub-selection? This can be especially important in datasets too large to thoroughly inspect by hand. An explicit statement of the curation rationale can help dataset users make inferences about what other kinds of texts systems trained with them could conceivably generalize to.
This dataset was curated by hand-crafting stereotype sentences by native speakers from the culture which is being targeted. An initial set of sentences was inferred from stereotypes expressed in the crowS-pairs data set(Nangia et al.). Native speakers first crafted templates for sentences expressing a stereotype. These templates are marked for gender and plurality of the target nouns, so the template can be reused by substituting different targets. Next, the template-target noun pair combinations were annotated for the veracity/reliability of the expressed stereotype. The resulting sentences express common and less common stereotypes in a variety of cultures and languages.
## B. LANGUAGE VARIETY/VARIETIES
> *Explanation.* Languages differ from each other in structural ways that can interact with NLP algorithms. Within a language, regional or social dialects can also show great variation (Chambers and Trudgill, 1998). The language and language variety should be described with a language tag from BCP-47 identifying the language variety (e.g., en-US or yue-Hant-HK), and a prose description of the language variety, glossing the BCP-47 tag and also providing further information (e.g., "English as spoken in Palo Alto, California", or "Cantonese written with traditional characters by speakers in Hong Kong who are bilingual in Mandarin").
* BCP-47 language tags: en-US, fr-FR, hi-IN, es-DO, ar-LY, ru-RU, de-DE, nl-NL, ta-IN.
* Language variety description: English spoken by native speakers of the United States, native French people from metropolitan France, native Hindi and Tamil speakers from India, Spanish speakers from the Dominican Republic, Arabic speakers from Libya, Russian speakers from Russia, German speakers from Germany, and Dutch speakers from the Netherlands.
## C. CONTRIBUTOR DEMOGRAPHIC
> ## C. SPEAKER DEMOGRAPHIC
> *Explanation.* Sociolinguistics has found that variation (in pronunciation, prosody, word choice, and grammar) correlates with speaker demographic characteristics (Labov, 1966), as speakers use linguistic variation to construct and project identities (Eckert and Rickford, 2001). Transfer from native languages (L1) can affect the language produced by non-native (L2) speakers (Ellis, 1994, Ch. 8). A further important type of variation is disordered speech (e.g., dysarthria). Specifications include:
Participants to the collection project were recruited through the HuggingFace BigScience project, and specifically the Bias and Fairness Evaluation group. Listed below.
Speakers:
* [ADD YOURSELF!]
* Shayne Longpre: English-speaking, male, 28 years old, culturally Canadian.
* Aurélie Névéol: French (native), English and Spanish speaking, female, 44 years old, culturally French (also familiar with American culture)
* Shanya Sharma: Hindi(native), English speaking, female, 24 years old, culturally Indian
* Margaret Mitchell: English, female, mid-30s, U.S.A.
* Maraim Masoud: Arabic, English Speaking female.
* Arjun Subramonian: English, Spanish, Tamil, non-binary, early-20s, USA, culturally Indian-American
## D. ANNOTATOR DEMOGRAPHIC
> *Explanation.* What are the demographic characteristics of the annotators and annotation guideline developers? Their own “social address” influences their experience with language and thus their perception of what they are annotating. Specifications include:
Participants to the collection project were recruited through the HuggingFace BigScience project, and specifically the Bias and Fairness Evaluation group. Speaker and annotator contributors listed in section C.
## E. SPEECH SITUATION
N/A
## F. TEXT CHARACTERISTICS
> *Explanation.* Both genre and topic influence the vocabulary and structural characteristics of texts (Biber, 1995), and should be specified.
Collected data is a collection of offensive stereotyped statements in numerous languages and cultures. They might be upsetting and/or offensive.
Along with these stereotyped statements are annotation judgements of how prevalent/real the expressed stereotypes are in the real world. Some statements were created from templates with substituted target nouns, and therefore may express an uncommon or unlikely stereotype.
## G. RECORDING QUALITY
N/A
## H. OTHER
> *Explanation.* There may be other information of relevance as well. Please use this space to develop any further categories that are relevant for your dataset.
## I. PROVENANCE APPENDIX
This initiative is part of the BigScience Workshop: https://bigscience.huggingface.co/.
## About this document
A data statement is a characterization of a dataset that provides context to allow developers and users to better understand how experimental results might generalize, how software might be appropriately deployed, and what biases might be reflected in systems built on the software.
Data Statements are from the University of Washington. Contact: [datastatements@uw.edu](mailto:datastatements@uw.edu). This document template is licensed as [CC0](https://creativecommons.org/share-your-work/public-domain/cc0/).
This version of the markdown Data Statement is from June 4th 2020. The Data Statement template is based on worksheets distributed at the [2020 LREC workshop on Data Statements](https://sites.google.com/uw.edu/data-statements-for-nlp/), by Emily M. Bender, Batya Friedman, and Angelina McMillan-Major. Adapted to community Markdown template by Leon Dercyznski.
提供机构:
BigScienceBiasEval
原始信息汇总
数据集概述
数据集名称
Bias-Shades
数据集开发者
该数据集由数十名研究科学家通过BigScience开放科学合作编制。合作者来自多种文化和语言,自愿加入该项目。
数据声明作者
Shayne Longpre, Aurélie Névéol, Shanya Sharma
许可证
Creative Commons Attribution-ShareAlike 4.0 (CC BY-SA 4.0)
A. 数据集选择理由
该数据集由母语为目标文化的本地人手工制作刻板印象句子。初始句子集是从crowS-pairs数据集中表达的刻板印象推断出来的。本地人首先为表达刻板印象的句子制作模板。这些模板标记了目标名词的性别和复数形式,因此可以通过替换不同的目标来重复使用。接下来,模板-目标名词对组合被标注为表达的刻板印象的真实性/可靠性。最终的句子表达了多种文化和语言中常见和不常见的刻板印象。
B. 语言变体
- BCP-47语言标签:en-US, fr-FR, hi-IN, es-DO, ar-LY, ru-RU, de-DE, nl-NL, ta-IN
- 语言变体描述:由美国本土英语人士、法国本土人士、印度本土印地语和泰米尔语人士、多米尼加共和国西班牙语人士、利比亚阿拉伯语人士、俄罗斯本土俄语人士、德国本土德语人士和荷兰本土荷兰语人士所说。
C. 贡献者人口统计
参与者通过HuggingFace BigScience项目的Bias and Fairness Evaluation组招募。
D. 标注者人口统计
参与者通过HuggingFace BigScience项目的Bias and Fairness Evaluation组招募。标注者和演讲者贡献者在C节中列出。
E. 语音情境
N/A
F. 文本特征
收集的数据是多种语言和文化中带有冒犯性的刻板印象陈述的集合。这些陈述可能令人不安和/或冒犯。除了这些刻板印象陈述外,还有关于表达的刻板印象在现实世界中普遍/真实程度的标注判断。一些陈述是从带有替换目标名词的模板创建的,因此可能表达一个不常见或不太可能的刻板印象。
G. 录音质量
N/A
H. 其他
N/A
I. 来源附录
该倡议是BigScience Workshop的一部分。
搜集汇总
数据集介绍

背景与挑战
背景概述
该数据集是一个多语言文本数据集,包含35,462行手工制作的刻板印象句子,涵盖多种语言和文化,用于偏见和公平性研究。数据格式为CSV,包含对刻板印象的注释和评估。
以上内容由遇见数据集搜集并总结生成



