neoyipeng/modernfinbert-training-v2-long

Name: neoyipeng/modernfinbert-training-v2-long
Creator: neoyipeng
Published: 2026-04-03 11:52:44
License: 暂无描述

Hugging Face2026-04-03 更新2026-04-12 收录

下载链接：

https://hf-mirror.com/datasets/neoyipeng/modernfinbert-training-v2-long

下载链接

链接失效反馈

官方服务：

资源简介：

--- language: - en license: mit size_categories: - 10K<n<100K task_categories: - text-classification task_ids: - sentiment-classification tags: - finance - sentiment-analysis - financial-nlp - long-context - earnings-calls - sec-filings - 10-K - MD&A pretty_name: ModernFinBERT Training Data v2 (Long Context) dataset_info: features: - name: text dtype: string - name: label dtype: string - name: source dtype: string - name: source_domain dtype: string - name: label_confidence dtype: string - name: entity dtype: string - name: entity_sentiment dtype: string splits: - name: train num_examples: 28820 - name: validation num_examples: 3602 - name: test num_examples: 3603 --- # ModernFinBERT Training Data v2 — Long Context A long-context companion to [modernfinbert-training-v2](https://huggingface.co/datasets/neoyipeng/modernfinbert-training-v2). Each row is a ~8,000-token passage from earnings call transcripts or SEC 10-K MD&A sections, with agent-labeled sentiment and entity annotations. Designed to leverage ModernBERT's 8,192-token context window for financial document understanding. ## Dataset Summary | Property | Value | |----------|-------| | Total examples | 36,025 | | Train / Val / Test | 28,820 / 3,602 / 3,603 | | Labels | NEGATIVE, NEUTRAL, POSITIVE | | Domains | Earnings call transcripts, SEC 10-K MD&A filings | | Text length | 2,000–32,000 chars (~500–8,000 tokens) | | Unique entities | 7,062 | | Entity coverage | 85.9% | | Unique companies | 1,316 | | License | MIT | ## Label Distribution | Label | Count | Percentage | |-------|-------|------------| | POSITIVE | 19,890 | 55.2% | | NEUTRAL | 9,182 | 25.5% | | NEGATIVE | 6,953 | 19.3% | ## Source Composition | Source | Dataset | Rows | Domain | License | |--------|---------|------|--------|---------| | `earnings_transcripts` | [glopardo/sp500-earnings-transcripts](https://huggingface.co/datasets/glopardo/sp500-earnings-transcripts) | 21,500 | Earnings calls | MIT | | `sec_10k_mda` | [jlohding/sp500-edgar-10k](https://huggingface.co/datasets/jlohding/sp500-edgar-10k) | 14,525 | SEC 10-K filings (Item 7 MD&A) | MIT | ## Label Distribution by Source | Source | NEGATIVE | NEUTRAL | POSITIVE | |--------|----------|---------|----------| | earnings_transcripts | 973 | 3,545 | 16,982 | | sec_10k_mda | 5,980 | 5,637 | 2,908 | Earnings call transcripts skew positive (management tends to present optimistic narratives). MD&A sections are more balanced, with significant negative content (risk disclosures, challenges, regulatory concerns). ## Schema | Column | Type | Description | |--------|------|-------------| | `text` | string | Long financial passage (~500–8,000 tokens) | | `label` | string | Overall passage sentiment: POSITIVE, NEGATIVE, or NEUTRAL | | `source` | string | Source dataset: `earnings_transcripts` or `sec_10k_mda` | | `source_domain` | string | Domain: `earnings_calls` or `sec_filings` | | `label_confidence` | string | Always `agent` (all labels determined by Claude Code agents) | | `entity` | string | Primary financial entity (company name, ticker, index, commodity, MARKET, or NONE) | | `entity_sentiment` | string | Sentiment toward the specific entity | ## How This Dataset Was Produced ### 1. Source Data - **Earnings transcripts**: 20,681 full S&P 500 earnings call transcripts from [glopardo/sp500-earnings-transcripts](https://huggingface.co/datasets/glopardo/sp500-earnings-transcripts), covering multiple years and quarters. - **SEC 10-K MD&A**: Item 7 (Management's Discussion and Analysis) sections from 6,282 S&P 500 annual filings via [jlohding/sp500-edgar-10k](https://huggingface.co/datasets/jlohding/sp500-edgar-10k). ### 2. Chunking Full transcripts (~13K tokens) and MD&A sections (~5K–50K tokens) were split into chunks of up to 32,000 characters (~8,000 tokens) to fit within ModernBERT's 8,192-token context window. Chunks shorter than 2,000 characters (~500 tokens) were discarded. ### 3. Deduplication Exact text deduplication on normalized text (lowercased, whitespace collapsed) removed 4,821 duplicates from the 60,543 raw chunks, yielding 55,722 unique passages. ### 4. Balancing Capped at ~43K rows to match the short-context v2 dataset. Balanced between sources: 21,500 earnings transcripts + 14,525 MD&A sections = 36,025 total. ### 5. Agent Annotation All labels were determined by Claude Code agent workers (no pre-existing labels in source data). 361 batches of 100 rows each were processed by 8+ parallel agents. Each agent read the text passage and determined: - `label`: overall passage sentiment - `entity`: primary financial entity mentioned - `entity_sentiment`: sentiment toward that entity After annotation, entities were text-verified: every non-NONE/MARKET entity was confirmed to appear in its text via substring matching. 3,188 hallucinated entities were reset to NONE. Final entity coverage: 85.9%. ### 6. Split 80/10/10 stratified split by source. Zero cross-split text leakage verified. ## Top Entities | Entity | Count | |--------|-------| | MARKET | 4,301 | | General Electric Company | 3,417 | | Ally Financial Inc. | 1,267 | | Bank of America | 1,047 | | General Motors Company | 591 | | Goldman Sachs | 493 | | Morgan Stanley | 463 | | Bank of America Corp. | 458 | | JPMorgan Chase & Co. | 439 | | Goldman Sachs Group Inc. | 346 | ## Intended Use - Fine-tuning ModernBERT (or other long-context encoders) for long-document financial sentiment classification - Training alongside [modernfinbert-training-v2](https://huggingface.co/datasets/neoyipeng/modernfinbert-training-v2) (short-context) for a model that handles both headlines and full documents - Entity-aware financial sentiment research on long-form text ## Limitations - **Agent-labeled**: All labels were determined by AI agents, not human annotators. May contain systematic biases. - **Positive skew in earnings calls**: Management presentations are inherently optimistic; the positive label dominance (55%) reflects this bias rather than balanced sentiment. - **Chunking artifacts**: Some passages may start/end mid-sentence at chunk boundaries. - **Entity duplication**: Some entities appear in both canonical and variant forms (e.g., "Bank of America" and "Bank of America Corp.") due to different agent workers. - **English only**: All text is in English. - **S&P 500 bias**: Both sources cover S&P 500 companies only; smaller companies are not represented. ## Companion Dataset This dataset is designed to be used alongside [neoyipeng/modernfinbert-training-v2](https://huggingface.co/datasets/neoyipeng/modernfinbert-training-v2) (43K rows of short-context financial text: headlines, tweets, analyst reports). Combined training produces a model that handles both short and long financial text. ## Citation ```bibtex @dataset{glopardo2024, title={SP500 Earnings Transcripts}, author={glopardo}, year={2024}, url={https://huggingface.co/datasets/glopardo/sp500-earnings-transcripts} } @dataset{jlohding2024, title={SP500 EDGAR 10-K}, author={jlohding}, year={2024}, url={https://huggingface.co/datasets/jlohding/sp500-edgar-10k} } ```

提供机构：

neoyipeng

5,000+

优质数据集

54 个

任务类型

进入经典数据集