IsGarrido/gender-classifier-dataset-en
收藏Hugging Face2025-12-07 更新2025-12-20 收录
下载链接:
https://hf-mirror.com/datasets/IsGarrido/gender-classifier-dataset-en
下载链接
链接失效反馈官方服务:
资源简介:
---
license: mit
task_categories:
- text-classification
language:
- en
tags:
- gender-classification
- synthetic
- distillation
- c4
size_categories:
- 10K<n<100K
---
# Gender Classification Dataset (English)
## Dataset Description
This dataset contains 30000 text samples labeled for the gender of the grammatical subject.
It was generated using a **Teacher-Student distillation** process.
- **Repository:** [IsGarrido/gender-classifier-dataset-en](https://huggingface.co/IsGarrido/gender-classifier-dataset-en)
- **Language:** English
- **Labels:** `male`, `female`, `neutral`
## Creation Process
1. **Source:** Sentences were streamed from the `allenai/c4` (Colossal Clean Crawled Corpus) dataset.
2. **Filtering:** Sentences between 50-200 characters were selected.
3. **Labeling (Teacher):** A local LLM (`mistralai/magistral-small-2509`) running via LM Studio analyzed each sentence.
4. **Prompt:** "Analyze the following sentence and identify the gender of the SUBJECT... Return ONLY one word: 'Male', 'Female', or 'Neutral'."
5. **Balancing:** The dataset generation was strictly controlled to ensure an even 33/33/33 split between classes.
## Data Structure
- `text`: The sentence (string).
- `label`: The gender class (string: "male", "female", "neutral").
## Usage
```python
from datasets import load_dataset
dataset = load_dataset("IsGarrido/gender-classifier-dataset-en")
print(dataset["train"][0])
```
提供机构:
IsGarrido



