Name: WillisBack/dataset-financial-user-claim
Creator: WillisBack
Published: 2026-04-06 11:40:48
License: 暂无描述

下载链接：

https://hf-mirror.com/datasets/WillisBack/dataset-financial-user-claim

下载链接

链接失效反馈

官方服务：

资源简介：

--- language: en license: apache-2.0 tags: - consumer-finance - complaints - text-classification - cfpb - nlp size_categories: - 100K<n<1M task_categories: - text-classification configs: - config_name: default data_files: - split: train path: train.parquet - split: test path: test.parquet --- # Consumer Finance Complaints — 11 Consolidated Labels A cleaned, deduplicated, and label-consolidated version of the [US CFPB Consumer Complaints dataset](https://www.consumerfinance.gov/data-research/consumer-complaints/) for **multi-class text classification**. ## Context This dataset was prepared as part of **Project 12 — "Compare AI Algorithms: Machine Learning vs. LLM"** of the [OpenClassrooms AI Developer certification](https://openclassrooms.com/) by **William Derue**. The project scenario involves **ZenAssist**, a customer support platform serving 200+ companies that wants to automatically label incoming consumer complaints to route them to the correct support department. ### Source The raw dataset was provided by OpenClassrooms: ``` https://s3.eu-west-1.amazonaws.com/course.oc-static.com/projects/2464_Développeur_IA/P12/dataset.csv ``` ## Dataset Description Each row contains a consumer complaint text and its product category label. | Column | Type | Description | | ------- | ------------ | ----------------------------------------------------------- | | `text` | `string` | The consumer complaint narrative (variable length, English) | | `label` | `ClassLabel` | Product category — one of 11 consolidated labels | ### Splits | Split | Rows | Usage | | ------- | ------: | ---------- | | `train` | 293,698 | Training | | `test` | 73,425 | Evaluation | Total: **367,123 rows** after cleaning. ### Label Distribution (full dataset) | Label | Count | % | | ----------------------- | ------: | ----: | | Credit reporting | 110,293 | 30.0% | | Debt collection | 84,318 | 23.0% | | Mortgage | 52,945 | 14.4% | | Credit card | 41,498 | 11.3% | | Bank account | 27,714 | 7.5% | | Student loan | 21,781 | 5.9% | | Consumer Loan | 9,443 | 2.6% | | Money transfer | 6,953 | 1.9% | | Payday loan | 6,155 | 1.7% | | Vehicle loan or lease | 5,720 | 1.6% | | Other financial service | 303 | 0.1% | ## Label Consolidation The original CFPB dataset contains **18 raw product tags** with significant semantic overlap. We consolidated them into **11 categories**: | Original Label | Consolidated Label | | ---------------------------------------------------------------------------- | --------------------------- | | Credit reporting, credit repair services, or other personal consumer reports | **Credit reporting** | | Credit reporting | **Credit reporting** | | Credit card | **Credit card** | | Credit card or prepaid card | **Credit card** | | Prepaid card | **Credit card** | | Bank account or service | **Bank account** | | Checking or savings account | **Bank account** | | Payday loan | **Payday loan** | | Payday loan, title loan, or personal loan | **Payday loan** | | Money transfer, virtual currency, or money service | **Money transfer** | | Money transfers | **Money transfer** | | Mortgage | **Mortgage** | | Debt collection | **Debt collection** | | Student loan | **Student loan** | | Consumer Loan | **Consumer Loan** | | Vehicle loan or lease | **Vehicle loan or lease** | | Other financial service | **Other financial service** | | Virtual currency | **Other financial service** | ### Rationale - **Semantic deduplication**: "Credit card" and "Credit card or prepaid card" describe the same product family. - **Virtual currency** (9 raw samples) was merged into **Other financial service** — too rare to learn as a standalone class. - This consolidation reduces label ambiguity and improves classifier performance without losing meaningful distinctions. ## Preprocessing 1. **Text column selection**: Used `Consumer Claim` (or `Customer Claim`) as the text field. 2. **Cleaning**: Stripped whitespace, removed empty rows and NaN values. 3. **Deduplication**: Dropped duplicate `(text, label)` pairs → 367,123 unique rows. 4. **Label mapping**: Applied the consolidation map above. 5. **Split**: 80/20 stratified train/test split (seed=3407). ## Usage ```python from datasets import load_dataset ds = load_dataset("WillisBack/dataset-financial-user-claim") print(ds) # DatasetDict({ # train: Dataset({features: ['text', 'label'], num_rows: 293698}) # test: Dataset({features: ['text', 'label'], num_rows: 73425}) # }) # Access a sample print(ds["train"][0]) # {'text': 'I was charged an overdraft fee...', 'label': 0} # Decode label print(ds["train"].features["label"].int2str(0)) # 'Bank account' ``` ## Associated Model This dataset was used to fine-tune [WillisBack/modernbert-large-consumer-finance-11cls](https://huggingface.co/WillisBack/modernbert-large-consumer-finance-11cls) — a ModernBERT-large encoder achieving **F1 macro 0.61** and **78.2% accuracy**. ## Citation ```bibtex @misc{derue2026cfpb11cls, author = {Derue, William}, title = {Consumer Finance Complaints Dataset — 11 Consolidated Labels}, year = {2026}, publisher = {Hugging Face}, url = {https://huggingface.co/datasets/WillisBack/dataset-financial-user-claim} } ```

应用场景：