neoyipeng/modernfinbert-training-v2-long
收藏Hugging Face2026-04-03 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/neoyipeng/modernfinbert-training-v2-long
下载链接
链接失效反馈官方服务:
资源简介:
---
language:
- en
license: mit
size_categories:
- 10K<n<100K
task_categories:
- text-classification
task_ids:
- sentiment-classification
tags:
- finance
- sentiment-analysis
- financial-nlp
- long-context
- earnings-calls
- sec-filings
- 10-K
- MD&A
pretty_name: ModernFinBERT Training Data v2 (Long Context)
dataset_info:
features:
- name: text
dtype: string
- name: label
dtype: string
- name: source
dtype: string
- name: source_domain
dtype: string
- name: label_confidence
dtype: string
- name: entity
dtype: string
- name: entity_sentiment
dtype: string
splits:
- name: train
num_examples: 28820
- name: validation
num_examples: 3602
- name: test
num_examples: 3603
---
# ModernFinBERT Training Data v2 — Long Context
A long-context companion to [modernfinbert-training-v2](https://huggingface.co/datasets/neoyipeng/modernfinbert-training-v2). Each row is a ~8,000-token passage from earnings call transcripts or SEC 10-K MD&A sections, with agent-labeled sentiment and entity annotations. Designed to leverage ModernBERT's 8,192-token context window for financial document understanding.
## Dataset Summary
| Property | Value |
|----------|-------|
| Total examples | 36,025 |
| Train / Val / Test | 28,820 / 3,602 / 3,603 |
| Labels | NEGATIVE, NEUTRAL, POSITIVE |
| Domains | Earnings call transcripts, SEC 10-K MD&A filings |
| Text length | 2,000–32,000 chars (~500–8,000 tokens) |
| Unique entities | 7,062 |
| Entity coverage | 85.9% |
| Unique companies | 1,316 |
| License | MIT |
## Label Distribution
| Label | Count | Percentage |
|-------|-------|------------|
| POSITIVE | 19,890 | 55.2% |
| NEUTRAL | 9,182 | 25.5% |
| NEGATIVE | 6,953 | 19.3% |
## Source Composition
| Source | Dataset | Rows | Domain | License |
|--------|---------|------|--------|---------|
| `earnings_transcripts` | [glopardo/sp500-earnings-transcripts](https://huggingface.co/datasets/glopardo/sp500-earnings-transcripts) | 21,500 | Earnings calls | MIT |
| `sec_10k_mda` | [jlohding/sp500-edgar-10k](https://huggingface.co/datasets/jlohding/sp500-edgar-10k) | 14,525 | SEC 10-K filings (Item 7 MD&A) | MIT |
## Label Distribution by Source
| Source | NEGATIVE | NEUTRAL | POSITIVE |
|--------|----------|---------|----------|
| earnings_transcripts | 973 | 3,545 | 16,982 |
| sec_10k_mda | 5,980 | 5,637 | 2,908 |
Earnings call transcripts skew positive (management tends to present optimistic narratives). MD&A sections are more balanced, with significant negative content (risk disclosures, challenges, regulatory concerns).
## Schema
| Column | Type | Description |
|--------|------|-------------|
| `text` | string | Long financial passage (~500–8,000 tokens) |
| `label` | string | Overall passage sentiment: POSITIVE, NEGATIVE, or NEUTRAL |
| `source` | string | Source dataset: `earnings_transcripts` or `sec_10k_mda` |
| `source_domain` | string | Domain: `earnings_calls` or `sec_filings` |
| `label_confidence` | string | Always `agent` (all labels determined by Claude Code agents) |
| `entity` | string | Primary financial entity (company name, ticker, index, commodity, MARKET, or NONE) |
| `entity_sentiment` | string | Sentiment toward the specific entity |
## How This Dataset Was Produced
### 1. Source Data
- **Earnings transcripts**: 20,681 full S&P 500 earnings call transcripts from [glopardo/sp500-earnings-transcripts](https://huggingface.co/datasets/glopardo/sp500-earnings-transcripts), covering multiple years and quarters.
- **SEC 10-K MD&A**: Item 7 (Management's Discussion and Analysis) sections from 6,282 S&P 500 annual filings via [jlohding/sp500-edgar-10k](https://huggingface.co/datasets/jlohding/sp500-edgar-10k).
### 2. Chunking
Full transcripts (~13K tokens) and MD&A sections (~5K–50K tokens) were split into chunks of up to 32,000 characters (~8,000 tokens) to fit within ModernBERT's 8,192-token context window. Chunks shorter than 2,000 characters (~500 tokens) were discarded.
### 3. Deduplication
Exact text deduplication on normalized text (lowercased, whitespace collapsed) removed 4,821 duplicates from the 60,543 raw chunks, yielding 55,722 unique passages.
### 4. Balancing
Capped at ~43K rows to match the short-context v2 dataset. Balanced between sources: 21,500 earnings transcripts + 14,525 MD&A sections = 36,025 total.
### 5. Agent Annotation
All labels were determined by Claude Code agent workers (no pre-existing labels in source data). 361 batches of 100 rows each were processed by 8+ parallel agents. Each agent read the text passage and determined:
- `label`: overall passage sentiment
- `entity`: primary financial entity mentioned
- `entity_sentiment`: sentiment toward that entity
After annotation, entities were text-verified: every non-NONE/MARKET entity was confirmed to appear in its text via substring matching. 3,188 hallucinated entities were reset to NONE. Final entity coverage: 85.9%.
### 6. Split
80/10/10 stratified split by source. Zero cross-split text leakage verified.
## Top Entities
| Entity | Count |
|--------|-------|
| MARKET | 4,301 |
| General Electric Company | 3,417 |
| Ally Financial Inc. | 1,267 |
| Bank of America | 1,047 |
| General Motors Company | 591 |
| Goldman Sachs | 493 |
| Morgan Stanley | 463 |
| Bank of America Corp. | 458 |
| JPMorgan Chase & Co. | 439 |
| Goldman Sachs Group Inc. | 346 |
## Intended Use
- Fine-tuning ModernBERT (or other long-context encoders) for long-document financial sentiment classification
- Training alongside [modernfinbert-training-v2](https://huggingface.co/datasets/neoyipeng/modernfinbert-training-v2) (short-context) for a model that handles both headlines and full documents
- Entity-aware financial sentiment research on long-form text
## Limitations
- **Agent-labeled**: All labels were determined by AI agents, not human annotators. May contain systematic biases.
- **Positive skew in earnings calls**: Management presentations are inherently optimistic; the positive label dominance (55%) reflects this bias rather than balanced sentiment.
- **Chunking artifacts**: Some passages may start/end mid-sentence at chunk boundaries.
- **Entity duplication**: Some entities appear in both canonical and variant forms (e.g., "Bank of America" and "Bank of America Corp.") due to different agent workers.
- **English only**: All text is in English.
- **S&P 500 bias**: Both sources cover S&P 500 companies only; smaller companies are not represented.
## Companion Dataset
This dataset is designed to be used alongside [neoyipeng/modernfinbert-training-v2](https://huggingface.co/datasets/neoyipeng/modernfinbert-training-v2) (43K rows of short-context financial text: headlines, tweets, analyst reports). Combined training produces a model that handles both short and long financial text.
## Citation
```bibtex
@dataset{glopardo2024,
title={SP500 Earnings Transcripts},
author={glopardo},
year={2024},
url={https://huggingface.co/datasets/glopardo/sp500-earnings-transcripts}
}
@dataset{jlohding2024,
title={SP500 EDGAR 10-K},
author={jlohding},
year={2024},
url={https://huggingface.co/datasets/jlohding/sp500-edgar-10k}
}
```
提供机构:
neoyipeng



