RudrakshNanavaty/earnings-call-data
收藏Hugging Face2026-04-12 更新2026-04-26 收录
下载链接:
https://hf-mirror.com/datasets/RudrakshNanavaty/earnings-call-data
下载链接
链接失效反馈官方服务:
资源简介:
---
language:
- en
license: mit
tags:
- finance
- earnings
- transcripts
- time-series
- parquet
- reinforcement-learning
- sp500
- xbrl
- fundamentals
size_categories:
- 10K-100K
task_categories:
- text-classification
- summarization
- feature-extraction
- text-retrieval
- reinforcement-learning
- other
pretty_name: S&P 500 earnings episodes (2005–2025; merged transcripts, prices, SEC, labels)
---
# S&P 500 earnings episodes (2005–2025)
**Augmented release** built on [`Bose345/sp500_earnings_transcripts`](https://huggingface.co/datasets/Bose345/sp500_earnings_transcripts) (same transcript calendar span as that collection: **2005–2025**). Static tabular data for supervised learning or RL-style experiments on **earnings-call episodes**. Each row is one company–quarter call, keyed by a stable `episode_id`, with long-form text (full earnings transcript, SEC press materials), pre-earnings price context, OHLCV anchors, **SEC XBRL fundamentals** (`xbrl_*` columns), and **post-earnings return labels**.
**Companion report:** **`sweetviz_episodes.html`** — a **Sweetviz** profile of `episodes.parquet`, shipped in this dataset repo. [View on the Hub](https://huggingface.co/datasets/RudrakshNanavaty/earnings-call-data/blob/main/sweetviz_episodes.html) or download the [raw file](https://huggingface.co/datasets/RudrakshNanavaty/earnings-call-data/resolve/main/sweetviz_episodes.html) and open it locally in a browser (distributions, missingness, associations).
---
## What’s in this folder
These files are the **materialized outputs** of the build pipeline (upstream Hugging Face transcripts → Yahoo Finance prices → SEC EDGAR 8-K press text → feature engineering → merge → optional XBRL join). Intermediate download caches usually live under `data/cache/` locally and are **not** required for analysis if you only use the parquet files below.
| File | Role |
|------|------|
| **`episodes.parquet`** | **Primary dataset** — one row per episode with identity, text, features, OHLCV anchors, **SEC XBRL fundamentals** (`xbrl_*`), and labels (see [Schema](#schema-episodesparquet)). |
| **`episodes_press_release_8k.parquet`** | **Subset** of `episodes.parquet`: only rows where `press_release_8k_body` is not null (same schema; fewer rows — on the order of **~16k** after a full pipeline run). [Browse on the Hub](https://huggingface.co/datasets/RudrakshNanavaty/earnings-call-data/blob/main/episodes_press_release_8k.parquet). Produced locally with `uv run python pipeline/filter_episodes_press_release_8k.py`. |
| **`sweetviz_episodes.html`** | **Exploratory HTML report** (Sweetviz) for `episodes.parquet`; same folder on the Hub as the parquet files ([see below](#sweetviz-html)). |
| `raw_hf.parquet` | Base transcript metadata and structured content source fields from the upstream Hugging Face dataset (see [Provenance](#provenance)). |
| `raw_prices.parquet` | Per-episode OHLCV anchors, sector, and price-derived fields from market data. |
| `raw_press_releases.parquet` | SEC 8-K body and exhibit text (e.g. EX-99.1 / EX-99.2) aligned to each episode. |
| `features.parquet` | Formatted earnings transcript, text flags, momentum/volume features, and label columns produced in the feature stage. |
Rough scale (after a full pipeline run): on the order of **~33k rows** in `episodes.parquet` and **~16k rows** in `episodes_press_release_8k.parquet`, and **hundreds of tickers** (in line with upstream transcript coverage), **2005–2025** span — confirm row and symbol counts on your copy with `len(pd.read_parquet("episodes.parquet"))` and `ep["symbol"].nunique()`.
---
## Schema (`episodes.parquet`)
Columns follow this order in the merged export:
**Identity:** `episode_id`, `symbol`, `company_name`, `company_id`, `year`, `quarter`, `date`, `earnings_date`, `sector`
**Text (observation):** `earnings_transcript`, `press_release_8k_body`, `press_release_ex991`, `press_release_ex992`, `press_release_sources`
**Text flags:** `guidance_mentioned`, `beat_mentioned`
**Pre-call price features:** `price_momentum_30d`, `price_momentum_90d`, `pct_from_52w_high_pt`, `avg_volume_20d`
**OHLCV anchors (grading / simulation):** `d_minus_1_*`, `d_plus_1_*`, `d_plus_30_*`, `next_qtr_d_minus_1_*` (open, high, low, close, volume as listed in the table)
**Labels / targets:** `sentiment_label`, `move_1d`, `move_30d`, `move_next_qtr`, `move_1d_direction`, `gap_open_d1`, `volume_surge_d1`
**Audit / quality:** `next_qtr_date`
**XBRL (SEC EDGAR companyfacts, 2009+):** Per-episode numeric facts from the SEC **company facts** JSON API (`data.sec.gov/api/xbrl/companyfacts/CIK{cik}.json`), documented under [SEC EDGAR APIs](https://www.sec.gov/edgar/sec-api-documentation). Facts use **`us-gaap`** concepts only. Episodes with **`year < 2009`** have nulls in all `xbrl_*` columns (no companyfacts match is attempted for those rows).
**How it is joined:** each episode’s ticker maps to a **CIK** via the same SEC ticker map used elsewhere in the pipeline (`data/cache/edgar/cik_map.json`, built during EDGAR steps or with `uv run python pipeline/build_cik_map.py`). If no CIK is found, companyfacts are not fetched for that row. After the merged table exists, run:
`uv run python pipeline/06_xbrl.py`
That step fills `xbrl_*` on **`episodes.parquet`** and refreshes **`episodes_press_release_8k.parquet`** with the same columns. Requests respect SEC rate limits (under 10 requests per second). When you run the pipeline locally, gaps and reasons are appended to **`reports/failures_xbrl.csv`** (not required to use the Hub parquet).
**Matching logic:** each metric tries **several GAAP local names in priority order** (e.g. revenue tries `Revenues`, then revenue-from-contract variants, then net sales) so more cells populate despite issuer tag choice; see `pipeline/06_xbrl.py` for the exact chains.
**Provenance (string):** for each value column there is a sibling `*_tag` column (e.g. `xbrl_revenue_tag`) with the **winning** local GAAP name, or null if the value is null.
- **Income statement:** `xbrl_revenue`, `xbrl_cost_of_revenue`, `xbrl_gross_profit`, `xbrl_operating_income`, `xbrl_net_income`, `xbrl_eps_basic`, `xbrl_eps_diluted` — plus `xbrl_revenue_tag`, …, `xbrl_eps_diluted_tag`
- **Balance sheet:** `xbrl_cash_and_cash_equivalents`, `xbrl_total_assets`, `xbrl_total_liabilities` — plus `xbrl_cash_and_cash_equivalents_tag`, `xbrl_total_assets_tag`, `xbrl_total_liabilities_tag`
- **Cash flow:** `xbrl_net_cash_operating_activities`, `xbrl_capital_expenditures` — plus `xbrl_net_cash_operating_activities_tag`, `xbrl_capital_expenditures_tag`
Treat these fields as **best-effort fundamentals aligned to the earnings quarter**, not audited restatements; expect **sparse cells** where filings, tags, or timing do not yield a match.
`sentiment_label` is derived from `move_1d` using fixed percentage bands (very bearish through very bullish). Treat labels as **historical hindsight** for research, not investment advice.
---
## Sweetviz HTML
The **Sweetviz** report is an exploratory companion to **`episodes.parquet`** only. It summarizes column types, missingness, numeric distributions, and target associations without loading the full frame in a notebook.
**On this Hub repo** the file lives next to the parquet exports:
- **Filename:** `sweetviz_episodes.html`
- **Browse:** [dataset files → `sweetviz_episodes.html`](https://huggingface.co/datasets/RudrakshNanavaty/earnings-call-data/tree/main)
- **Direct download:** [`.../resolve/main/sweetviz_episodes.html`](https://huggingface.co/datasets/RudrakshNanavaty/earnings-call-data/resolve/main/sweetviz_episodes.html)
**Download with Python** ([`huggingface_hub`](https://huggingface.co/docs/huggingface_hub)):
```python
from huggingface_hub import hf_hub_download
path = hf_hub_download(
repo_id="RudrakshNanavaty/earnings-call-data",
filename="sweetviz_episodes.html",
repo_type="dataset",
)
print(path) # open this path in a browser
```
**Regenerate locally** (from the pipeline repo that produced these files):
`uv run python pipeline/sweetviz_report.py data/episodes.parquet -o reports/sweetviz_episodes.html`
Sweetviz is a third-party tool; report content reflects the table at generation time.
---
## Provenance
- **Transcripts / call metadata:** same underlying universe and years as [`Bose345/sp500_earnings_transcripts`](https://huggingface.co/datasets/Bose345/sp500_earnings_transcripts) (this release **augments** those transcripts with market, SEC, and label columns; respect that dataset’s license and terms when redistributing derived work).
- **Market data:** via [yfinance](https://github.com/ranaroussi/yfinance) (subject to Yahoo / vendor terms of use).
- **Filings:** U.S. SEC EDGAR public data (comply with [SEC fair access](https://www.sec.gov/os/accessing-edgar-data) and rate-limiting expectations when re-fetching).
- **XBRL fundamentals:** derived from SEC **company facts** (same public data policy as above); re-fetch only with a proper [User-Agent](https://www.sec.gov/os/accessing-edgar-data) and polite throughput.
This package is a **processed merge** for research; it is not an official SEC or exchange product.
---
## Loading examples
**pandas / PyArrow**
```python
import pandas as pd
ep = pd.read_parquet("episodes.parquet")
print(ep.shape, ep.columns[:5].tolist())
# Optional: only episodes with SEC 8-K body text populated
ep_8k = pd.read_parquet("episodes_press_release_8k.parquet")
print(ep_8k.shape)
# Optional: rows with at least headline XBRL (example)
ep_xbrl = ep.dropna(subset=["xbrl_revenue", "xbrl_net_income"])
print(ep_xbrl.shape)
```
**Hugging Face `datasets`** (if you upload parquet to a Hub dataset repo)
```python
from datasets import Dataset
ds = Dataset.from_parquet("episodes.parquet") # or hf://datasets/<user>/<name>/path.parquet
print(ds)
```
---
## Use cases
- Train or evaluate models on **text + tabular market context** with aligned **forward returns** and optional **reported fundamentals** (`xbrl_*`).
- Build **RL environments** where observations include call text and pre-earnings features and rewards depend on realized moves (subject to your own leakage and causality checks).
- Reproduce or extend the pipeline using the sibling repository that emits these files.
---
## Limitations
- Rows may contain **nulls** where a source (e.g. a filing or price window) was missing; use the audit columns and null summaries in the Sweetviz report or your own QC.
- **`xbrl_*` columns are intentionally sparse:** many episodes will have nulls (no CIK, no matching GAAP fact for the quarter, or `year < 2009`). Do not assume complete fundamentals coverage.
- **Survivorship and sample bias** follow the upstream universe and filters.
- **Non-stationarity:** financial regimes change; test generalization across time and sectors.
---
## Citation
If you use this dataset, cite the **upstream transcript dataset** as its authors request, plus a citation or link to **this Hub dataset**. Example BibTeX skeleton (fill in author as appropriate):
```bibtex
@misc{earnings_episodes_2026,
title = {S\&P 500 Earnings Episodes (merged transcripts, prices, SEC, labels)},
author = {YOUR NAME OR ORG},
year = {2026},
howpublished = {\url{https://huggingface.co/datasets/RudrakshNanavaty/earnings-call-data}},
note = {Augments Bose345/sp500\_earnings\_transcripts (2005--2025); adds yfinance, SEC EDGAR-derived fields, and optional SEC XBRL companyfacts (us-gaap) on episodes from 2009+.}
}
```
---
## License
This dataset card specifies **MIT** (`license: mit` in the frontmatter). You remain responsible for **upstream** terms (e.g. the Hugging Face transcript dataset, Yahoo/yfinance, SEC redistribution) when publishing or redistributing derived data.
提供机构:
RudrakshNanavaty



