WillisBack/dataset-financial-user-claim
收藏Hugging Face2026-04-06 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/WillisBack/dataset-financial-user-claim
下载链接
链接失效反馈官方服务:
资源简介:
---
language: en
license: apache-2.0
tags:
- consumer-finance
- complaints
- text-classification
- cfpb
- nlp
size_categories:
- 100K<n<1M
task_categories:
- text-classification
configs:
- config_name: default
data_files:
- split: train
path: train.parquet
- split: test
path: test.parquet
---
# Consumer Finance Complaints — 11 Consolidated Labels
A cleaned, deduplicated, and label-consolidated version of the [US CFPB Consumer Complaints dataset](https://www.consumerfinance.gov/data-research/consumer-complaints/) for **multi-class text classification**.
## Context
This dataset was prepared as part of **Project 12 — "Compare AI Algorithms: Machine Learning vs. LLM"** of the [OpenClassrooms AI Developer certification](https://openclassrooms.com/) by **William Derue**.
The project scenario involves **ZenAssist**, a customer support platform serving 200+ companies that wants to automatically label incoming consumer complaints to route them to the correct support department.
### Source
The raw dataset was provided by OpenClassrooms:
```
https://s3.eu-west-1.amazonaws.com/course.oc-static.com/projects/2464_Développeur_IA/P12/dataset.csv
```
## Dataset Description
Each row contains a consumer complaint text and its product category label.
| Column | Type | Description |
| ------- | ------------ | ----------------------------------------------------------- |
| `text` | `string` | The consumer complaint narrative (variable length, English) |
| `label` | `ClassLabel` | Product category — one of 11 consolidated labels |
### Splits
| Split | Rows | Usage |
| ------- | ------: | ---------- |
| `train` | 293,698 | Training |
| `test` | 73,425 | Evaluation |
Total: **367,123 rows** after cleaning.
### Label Distribution (full dataset)
| Label | Count | % |
| ----------------------- | ------: | ----: |
| Credit reporting | 110,293 | 30.0% |
| Debt collection | 84,318 | 23.0% |
| Mortgage | 52,945 | 14.4% |
| Credit card | 41,498 | 11.3% |
| Bank account | 27,714 | 7.5% |
| Student loan | 21,781 | 5.9% |
| Consumer Loan | 9,443 | 2.6% |
| Money transfer | 6,953 | 1.9% |
| Payday loan | 6,155 | 1.7% |
| Vehicle loan or lease | 5,720 | 1.6% |
| Other financial service | 303 | 0.1% |
## Label Consolidation
The original CFPB dataset contains **18 raw product tags** with significant semantic overlap. We consolidated them into **11 categories**:
| Original Label | Consolidated Label |
| ---------------------------------------------------------------------------- | --------------------------- |
| Credit reporting, credit repair services, or other personal consumer reports | **Credit reporting** |
| Credit reporting | **Credit reporting** |
| Credit card | **Credit card** |
| Credit card or prepaid card | **Credit card** |
| Prepaid card | **Credit card** |
| Bank account or service | **Bank account** |
| Checking or savings account | **Bank account** |
| Payday loan | **Payday loan** |
| Payday loan, title loan, or personal loan | **Payday loan** |
| Money transfer, virtual currency, or money service | **Money transfer** |
| Money transfers | **Money transfer** |
| Mortgage | **Mortgage** |
| Debt collection | **Debt collection** |
| Student loan | **Student loan** |
| Consumer Loan | **Consumer Loan** |
| Vehicle loan or lease | **Vehicle loan or lease** |
| Other financial service | **Other financial service** |
| Virtual currency | **Other financial service** |
### Rationale
- **Semantic deduplication**: "Credit card" and "Credit card or prepaid card" describe the same product family.
- **Virtual currency** (9 raw samples) was merged into **Other financial service** — too rare to learn as a standalone class.
- This consolidation reduces label ambiguity and improves classifier performance without losing meaningful distinctions.
## Preprocessing
1. **Text column selection**: Used `Consumer Claim` (or `Customer Claim`) as the text field.
2. **Cleaning**: Stripped whitespace, removed empty rows and NaN values.
3. **Deduplication**: Dropped duplicate `(text, label)` pairs → 367,123 unique rows.
4. **Label mapping**: Applied the consolidation map above.
5. **Split**: 80/20 stratified train/test split (seed=3407).
## Usage
```python
from datasets import load_dataset
ds = load_dataset("WillisBack/dataset-financial-user-claim")
print(ds)
# DatasetDict({
# train: Dataset({features: ['text', 'label'], num_rows: 293698})
# test: Dataset({features: ['text', 'label'], num_rows: 73425})
# })
# Access a sample
print(ds["train"][0])
# {'text': 'I was charged an overdraft fee...', 'label': 0}
# Decode label
print(ds["train"].features["label"].int2str(0))
# 'Bank account'
```
## Associated Model
This dataset was used to fine-tune [WillisBack/modernbert-large-consumer-finance-11cls](https://huggingface.co/WillisBack/modernbert-large-consumer-finance-11cls) — a ModernBERT-large encoder achieving **F1 macro 0.61** and **78.2% accuracy**.
## Citation
```bibtex
@misc{derue2026cfpb11cls,
author = {Derue, William},
title = {Consumer Finance Complaints Dataset — 11 Consolidated Labels},
year = {2026},
publisher = {Hugging Face},
url = {https://huggingface.co/datasets/WillisBack/dataset-financial-user-claim}
}
```
提供机构:
WillisBack



