neoyipeng/modernfinbert-training-v2

Name: neoyipeng/modernfinbert-training-v2
Creator: neoyipeng
Published: 2026-04-02 08:49:37
License: 暂无描述

Hugging Face2026-04-02 更新2026-04-12 收录

下载链接：

https://hf-mirror.com/datasets/neoyipeng/modernfinbert-training-v2

下载链接

链接失效反馈

官方服务：

资源简介：

--- language: - en license: odc-by size_categories: - 10K<n<100K task_categories: - text-classification task_ids: - sentiment-classification tags: - finance - sentiment-analysis - financial-nlp - financial-sentiment - earnings-calls - financial-news - financial-tweets pretty_name: ModernFinBERT Training Data v2 dataset_info: features: - name: text dtype: string - name: label dtype: string - name: source dtype: string - name: source_domain dtype: string - name: label_confidence dtype: string - name: entity dtype: string - name: entity_sentiment dtype: string splits: - name: train num_examples: 34597 - name: validation num_examples: 4325 - name: test num_examples: 4325 --- # ModernFinBERT Training Data v2 A multi-source financial sentiment dataset for training financial sentiment models. Combines 5 source datasets spanning financial news, social media, analyst reports, and earnings call transcripts, with unified 3-class sentiment labels and agent-labeled entity-level sentiment annotations. ## Dataset Summary | Property | Value | |----------|-------| | Total examples | 43,247 | | Train / Val / Test | 34,597 / 4,325 / 4,325 | | Labels | NEGATIVE, NEUTRAL, POSITIVE | | Domains | Financial news, social media, analyst reports, earnings calls | | Unique entities | 4,209 | | Entity coverage | 59.2% | | License | ODC-By (Open Data Commons Attribution) | ## Label Distribution | Label | Count | Percentage | |-------|-------|------------| | NEUTRAL | 16,469 | 38.1% | | POSITIVE | 14,677 | 33.9% | | NEGATIVE | 12,101 | 28.0% | ## Source Composition | Source | Dataset | Rows | Domain | Original License | |--------|---------|------|--------|-----------------| | `nosible` | [NOSIBLE/financial-sentiment](https://huggingface.co/datasets/NOSIBLE/financial-sentiment) | 15,000 | Financial news | ODC-By | | `timkoornstra_tweets` | [TimKoornstra/financial-tweets-sentiment](https://huggingface.co/datasets/TimKoornstra/financial-tweets-sentiment) | 15,000 | Social media | MIT | | `financemteb_finsent` | [FinanceMTEB/FinSent](https://huggingface.co/datasets/FinanceMTEB/FinSent) | 9,929 | Analyst reports | Not specified | | `subjectiveqa` | [gtfintechlab/SubjECTive-QA](https://huggingface.co/datasets/gtfintechlab/SubjECTive-QA) | 2,621 | Earnings calls | CC-BY 4.0 | | `aiera_transcripts` | [Aiera/aiera-transcript-sentiment](https://huggingface.co/datasets/Aiera/aiera-transcript-sentiment) | 697 | Earnings calls | MIT | Large sources (NOSIBLE, TimKoornstra) were capped at 15,000 rows via label-stratified downsampling (5,000 per class). Small sources were kept at their natural size without upsampling. No maximum text length cap was applied -- ModernBERT supports up to 8,192 tokens. ## Label Distribution by Source | Source | NEGATIVE | NEUTRAL | POSITIVE | |--------|----------|---------|----------| | nosible | 5,000 | 5,000 | 5,000 | | timkoornstra_tweets | 5,000 | 5,000 | 5,000 | | financemteb_finsent | 1,825 | 4,545 | 3,559 | | subjectiveqa | 210 | 1,498 | 913 | | aiera_transcripts | 66 | 426 | 205 | ## Schema | Column | Type | Description | |--------|------|-------------| | `text` | string | Financial text to classify | | `label` | string | Sentence-level sentiment: POSITIVE, NEGATIVE, or NEUTRAL | | `source` | string | Source dataset identifier | | `source_domain` | string | Domain category: `financial_news`, `social_media`, `analyst_reports`, or `earnings_calls` | | `label_confidence` | string | Label provenance: `human`, `human_aggregated`, or `llm_consensus` | | `entity` | string | Primary financial entity mentioned (company, ticker, index, commodity, sector, MARKET, or NONE) | | `entity_sentiment` | string | Sentiment toward the specific entity (may differ from sentence-level label in multi-entity texts) | ## How This Dataset Was Produced ### 1. Collection & Label Alignment Five source datasets were downloaded from HuggingFace and aligned to a unified 3-class schema (NEGATIVE / NEUTRAL / POSITIVE): - **NOSIBLE**: Labels used as-is (positive/negative/neutral). No maximum text length cap applied. - **TimKoornstra**: Mapped bullish -> POSITIVE, bearish -> NEGATIVE, neutral -> NEUTRAL - **FinanceMTEB FinSent**: Labels used as-is (positive/negative/neutral) - **Aiera**: Labels used as-is (positive/negative/neutral) - **SubjECTive-QA**: Used the OPTIMISTIC dimension (0 -> NEGATIVE, 1 -> NEUTRAL, 2 -> POSITIVE) from the ANSWER column (earnings call responses) Quality filters applied: minimum text length (10-20 chars depending on source), null removal. ### 2. Deduplication Two-level deduplication removed 3,056 duplicates from the combined 151,339 rows: 1. **Exact dedup**: Normalized text (lowercased, whitespace collapsed), removed 95 exact duplicates 2. **Near-duplicate dedup**: MinHash with character 5-gram shingling (64 permutations, 16 LSH bands, Jaccard threshold > 0.8), removed 2,961 near-duplicates Result: 148,283 unique rows. ### 3. Balancing - Large sources (NOSIBLE, TimKoornstra) capped at 15,000 via label-stratified sampling (5,000 per class) - Small sources (FinSent, SubjECTive-QA, Aiera) kept at natural size without upsampling - Final dataset: 43,247 rows ### 4. Entity & Entity Sentiment Annotation Entity extraction and entity-level sentiment were determined entirely by Claude Code agent workers in two passes. No external NLP packages (no spaCy, no NLTK) were used -- annotations are purely agent-determined. **Pass 1 -- Initial annotation**: The balanced dataset (43,247 rows) was split into 217 batches of 200 rows. Multiple parallel agent workers read each batch, examined every text, and determined the `entity` and `entity_sentiment` using their financial domain knowledge. **Validation**: After Pass 1, every non-NONE entity was verified to actually appear in its text using substring matching with canonical name variants. Entities that did not appear in the text (hallucinated by agents) were reset to NONE. This removed 17,247 incorrect annotations (67% of initial non-NONE entities). **Pass 2 -- NONE relabeling**: The 34,397 rows marked NONE after validation were re-batched into 172 batches and sent to a second round of parallel agent workers for more careful annotation. This recovered 16,757 additional entity labels. After the same text-verification step, 4,281 hallucinated entities from Pass 2 were removed. **Final result**: 25,607 rows (59.2%) have a verified entity, 17,640 rows (40.8%) remain NONE. 4,209 unique entities across the dataset. Every non-NONE entity is verified to appear in its text. **Entity breakdown**: - MARKET (general market commentary): 10,151 rows (23.5%) - Named entities (companies, tickers, indices, commodities): 15,456 rows (35.7%) - NONE (no identifiable entity): 17,640 rows (40.8%) ### 5. Train/Val/Test Split 80/10/10 stratified split on (source, label) pairs to preserve proportions across splits. Zero cross-split text leakage verified. ## Intended Use This dataset is designed for training and evaluating financial sentiment classification models, particularly: - Fine-tuning encoder models (BERT, RoBERTa, ModernBERT, DeBERTa) for 3-class financial sentiment - Benchmarking against held-out datasets: Financial PhraseBank, FiQA-SA, Twitter Financial News Sentiment - Entity-aware financial sentiment analysis ## Limitations - **Entity hallucination**: Despite two-pass annotation with text verification, some edge cases may remain where entity names partially match unrelated text - **NONE coverage**: 40.8% of rows have no identified entity, typically generic financial commentary or earnings call responses without explicit company names - **Label noise**: The NOSIBLE source (34.7% of data) uses LLM-consensus labels, which may have systematic biases compared to human annotation - **Domain bias**: Financial news and social media dominate (69.4%); earnings call coverage is limited (7.7%) - **English only**: All text is in English - **Temporal coverage**: Source datasets span different time periods; no temporal metadata is preserved in this version ## Citation If you use this dataset, please cite the original source datasets: ```bibtex @dataset{nosible2025, title={NOSIBLE Financial Sentiment}, author={NOSIBLE}, year={2025}, url={https://huggingface.co/datasets/NOSIBLE/financial-sentiment} } @dataset{timkoornstra2024, title={Financial Tweets Sentiment}, author={Tim Koornstra}, year={2024}, url={https://huggingface.co/datasets/TimKoornstra/financial-tweets-sentiment} } @dataset{financemteb2024, title={FinSent}, author={FinanceMTEB}, year={2024}, url={https://huggingface.co/datasets/FinanceMTEB/FinSent} } @inproceedings{subjectiveqa2024, title={SubjECTive-QA: Measuring Subjectivity in Earnings Call Transcripts}, author={GTFinTechLab}, year={2024}, eprint={2410.20651}, archivePrefix={arXiv} } @dataset{aiera2024, title={Aiera Transcript Sentiment}, author={Aiera}, year={2024}, url={https://huggingface.co/datasets/Aiera/aiera-transcript-sentiment} } ```

提供机构：

neoyipeng

5,000+

优质数据集

54 个

任务类型

进入经典数据集