2024-mcm-everitt-ryan/possible-bias-classified
收藏Hugging Face2024-05-28 更新2024-06-12 收录
下载链接:
https://hf-mirror.com/datasets/2024-mcm-everitt-ryan/possible-bias-classified
下载链接
链接失效反馈官方服务:
资源简介:
---
dataset_info:
features:
- name: document_id
dtype: string
- name: phrase_index
dtype: int64
- name: country
dtype: string
- name: company
dtype: string
- name: position
dtype: string
- name: terms_age
dtype: string
- name: terms_age_count
dtype: int64
- name: terms_disability
dtype: string
- name: terms_disability_count
dtype: int64
- name: terms_general
dtype: string
- name: terms_general_count
dtype: int64
- name: terms_masculine
dtype: string
- name: terms_masculine_count
dtype: int64
- name: terms_feminine
dtype: string
- name: terms_feminine_count
dtype: int64
- name: terms_racial
dtype: string
- name: terms_racial_count
dtype: int64
- name: terms_sexuality
dtype: string
- name: terms_sexuality_count
dtype: int64
- name: phrase_word_count
dtype: int64
- name: phrase
dtype: string
- name: html
dtype: string
- name: model_score_age
dtype: float64
- name: model_score_disability
dtype: float64
- name: model_score_general
dtype: float64
- name: model_score_masculine
dtype: float64
- name: model_score_feminine
dtype: float64
- name: model_score_racial
dtype: float64
- name: model_score_sexuality
dtype: float64
splits:
- name: train
num_bytes: 9772718976
num_examples: 1700000
download_size: 3115716891
dataset_size: 9772718976
configs:
- config_name: default
data_files:
- split: train
path: data/train-*
---
This dataset is designed for analyzing and evaluating the distribution and frequency of terms related to age, disability, gender, race, and sexuality across various documents. It includes features such as document ID, phrase index, country, company, position, and counts of terms related to age, disability, gender, race, and sexuality. Additionally, it contains the word count of phrases, phrase content, HTML formatted content, and model scores for different terms. The dataset is divided into a training set with 1.7 million samples, totaling 9.77GB in size.
提供机构:
2024-mcm-everitt-ryan
原始信息汇总
数据集概述
数据集特征
- document_id: 字符串类型
- phrase_index: 整数类型(int64)
- country: 字符串类型
- company: 字符串类型
- position: 字符串类型
- terms_age: 字符串类型
- terms_age_count: 整数类型(int64)
- terms_disability: 字符串类型
- terms_disability_count: 整数类型(int64)
- terms_general: 字符串类型
- terms_general_count: 整数类型(int64)
- terms_masculine: 字符串类型
- terms_masculine_count: 整数类型(int64)
- terms_feminine: 字符串类型
- terms_feminine_count: 整数类型(int64)
- terms_racial: 字符串类型
- terms_racial_count: 整数类型(int64)
- terms_sexuality: 字符串类型
- terms_sexuality_count: 整数类型(int64)
- phrase_word_count: 整数类型(int64)
- phrase: 字符串类型
- html: 字符串类型
- model_score_age: 浮点数类型(float64)
- model_score_disability: 浮点数类型(float64)
- model_score_general: 浮点数类型(float64)
- model_score_masculine: 浮点数类型(float64)
- model_score_feminine: 浮点数类型(float64)
- model_score_racial: 浮点数类型(float64)
- model_score_sexuality: 浮点数类型(float64)
数据集分割
- train:
- 数据量: 9772718976 字节
- 样本数: 1700000
数据集大小
- 下载大小: 3115716891 字节
- 数据集总大小: 9772718976 字节
配置
- config_name: default
- data_files:
- split: train
- path: data/train-*
- split: train
- data_files:



