five

RudrakshNanavaty/earnings-call-data

收藏
Hugging Face2026-04-12 更新2026-04-26 收录
下载链接:
https://hf-mirror.com/datasets/RudrakshNanavaty/earnings-call-data
下载链接
链接失效反馈
官方服务:
资源简介:
--- language: - en license: mit tags: - finance - earnings - transcripts - time-series - parquet - reinforcement-learning - sp500 - xbrl - fundamentals size_categories: - 10K-100K task_categories: - text-classification - summarization - feature-extraction - text-retrieval - reinforcement-learning - other pretty_name: S&P 500 earnings episodes (2005–2025; merged transcripts, prices, SEC, labels) --- # S&P 500 earnings episodes (2005–2025) **Augmented release** built on [`Bose345/sp500_earnings_transcripts`](https://huggingface.co/datasets/Bose345/sp500_earnings_transcripts) (same transcript calendar span as that collection: **2005–2025**). Static tabular data for supervised learning or RL-style experiments on **earnings-call episodes**. Each row is one company–quarter call, keyed by a stable `episode_id`, with long-form text (full earnings transcript, SEC press materials), pre-earnings price context, OHLCV anchors, **SEC XBRL fundamentals** (`xbrl_*` columns), and **post-earnings return labels**. **Companion report:** **`sweetviz_episodes.html`** — a **Sweetviz** profile of `episodes.parquet`, shipped in this dataset repo. [View on the Hub](https://huggingface.co/datasets/RudrakshNanavaty/earnings-call-data/blob/main/sweetviz_episodes.html) or download the [raw file](https://huggingface.co/datasets/RudrakshNanavaty/earnings-call-data/resolve/main/sweetviz_episodes.html) and open it locally in a browser (distributions, missingness, associations). --- ## What’s in this folder These files are the **materialized outputs** of the build pipeline (upstream Hugging Face transcripts → Yahoo Finance prices → SEC EDGAR 8-K press text → feature engineering → merge → optional XBRL join). Intermediate download caches usually live under `data/cache/` locally and are **not** required for analysis if you only use the parquet files below. | File | Role | |------|------| | **`episodes.parquet`** | **Primary dataset** — one row per episode with identity, text, features, OHLCV anchors, **SEC XBRL fundamentals** (`xbrl_*`), and labels (see [Schema](#schema-episodesparquet)). | | **`episodes_press_release_8k.parquet`** | **Subset** of `episodes.parquet`: only rows where `press_release_8k_body` is not null (same schema; fewer rows — on the order of **~16k** after a full pipeline run). [Browse on the Hub](https://huggingface.co/datasets/RudrakshNanavaty/earnings-call-data/blob/main/episodes_press_release_8k.parquet). Produced locally with `uv run python pipeline/filter_episodes_press_release_8k.py`. | | **`sweetviz_episodes.html`** | **Exploratory HTML report** (Sweetviz) for `episodes.parquet`; same folder on the Hub as the parquet files ([see below](#sweetviz-html)). | | `raw_hf.parquet` | Base transcript metadata and structured content source fields from the upstream Hugging Face dataset (see [Provenance](#provenance)). | | `raw_prices.parquet` | Per-episode OHLCV anchors, sector, and price-derived fields from market data. | | `raw_press_releases.parquet` | SEC 8-K body and exhibit text (e.g. EX-99.1 / EX-99.2) aligned to each episode. | | `features.parquet` | Formatted earnings transcript, text flags, momentum/volume features, and label columns produced in the feature stage. | Rough scale (after a full pipeline run): on the order of **~33k rows** in `episodes.parquet` and **~16k rows** in `episodes_press_release_8k.parquet`, and **hundreds of tickers** (in line with upstream transcript coverage), **2005–2025** span — confirm row and symbol counts on your copy with `len(pd.read_parquet("episodes.parquet"))` and `ep["symbol"].nunique()`. --- ## Schema (`episodes.parquet`) Columns follow this order in the merged export: **Identity:** `episode_id`, `symbol`, `company_name`, `company_id`, `year`, `quarter`, `date`, `earnings_date`, `sector` **Text (observation):** `earnings_transcript`, `press_release_8k_body`, `press_release_ex991`, `press_release_ex992`, `press_release_sources` **Text flags:** `guidance_mentioned`, `beat_mentioned` **Pre-call price features:** `price_momentum_30d`, `price_momentum_90d`, `pct_from_52w_high_pt`, `avg_volume_20d` **OHLCV anchors (grading / simulation):** `d_minus_1_*`, `d_plus_1_*`, `d_plus_30_*`, `next_qtr_d_minus_1_*` (open, high, low, close, volume as listed in the table) **Labels / targets:** `sentiment_label`, `move_1d`, `move_30d`, `move_next_qtr`, `move_1d_direction`, `gap_open_d1`, `volume_surge_d1` **Audit / quality:** `next_qtr_date` **XBRL (SEC EDGAR companyfacts, 2009+):** Per-episode numeric facts from the SEC **company facts** JSON API (`data.sec.gov/api/xbrl/companyfacts/CIK{cik}.json`), documented under [SEC EDGAR APIs](https://www.sec.gov/edgar/sec-api-documentation). Facts use **`us-gaap`** concepts only. Episodes with **`year < 2009`** have nulls in all `xbrl_*` columns (no companyfacts match is attempted for those rows). **How it is joined:** each episode’s ticker maps to a **CIK** via the same SEC ticker map used elsewhere in the pipeline (`data/cache/edgar/cik_map.json`, built during EDGAR steps or with `uv run python pipeline/build_cik_map.py`). If no CIK is found, companyfacts are not fetched for that row. After the merged table exists, run: `uv run python pipeline/06_xbrl.py` That step fills `xbrl_*` on **`episodes.parquet`** and refreshes **`episodes_press_release_8k.parquet`** with the same columns. Requests respect SEC rate limits (under 10 requests per second). When you run the pipeline locally, gaps and reasons are appended to **`reports/failures_xbrl.csv`** (not required to use the Hub parquet). **Matching logic:** each metric tries **several GAAP local names in priority order** (e.g. revenue tries `Revenues`, then revenue-from-contract variants, then net sales) so more cells populate despite issuer tag choice; see `pipeline/06_xbrl.py` for the exact chains. **Provenance (string):** for each value column there is a sibling `*_tag` column (e.g. `xbrl_revenue_tag`) with the **winning** local GAAP name, or null if the value is null. - **Income statement:** `xbrl_revenue`, `xbrl_cost_of_revenue`, `xbrl_gross_profit`, `xbrl_operating_income`, `xbrl_net_income`, `xbrl_eps_basic`, `xbrl_eps_diluted` — plus `xbrl_revenue_tag`, …, `xbrl_eps_diluted_tag` - **Balance sheet:** `xbrl_cash_and_cash_equivalents`, `xbrl_total_assets`, `xbrl_total_liabilities` — plus `xbrl_cash_and_cash_equivalents_tag`, `xbrl_total_assets_tag`, `xbrl_total_liabilities_tag` - **Cash flow:** `xbrl_net_cash_operating_activities`, `xbrl_capital_expenditures` — plus `xbrl_net_cash_operating_activities_tag`, `xbrl_capital_expenditures_tag` Treat these fields as **best-effort fundamentals aligned to the earnings quarter**, not audited restatements; expect **sparse cells** where filings, tags, or timing do not yield a match. `sentiment_label` is derived from `move_1d` using fixed percentage bands (very bearish through very bullish). Treat labels as **historical hindsight** for research, not investment advice. --- ## Sweetviz HTML The **Sweetviz** report is an exploratory companion to **`episodes.parquet`** only. It summarizes column types, missingness, numeric distributions, and target associations without loading the full frame in a notebook. **On this Hub repo** the file lives next to the parquet exports: - **Filename:** `sweetviz_episodes.html` - **Browse:** [dataset files → `sweetviz_episodes.html`](https://huggingface.co/datasets/RudrakshNanavaty/earnings-call-data/tree/main) - **Direct download:** [`.../resolve/main/sweetviz_episodes.html`](https://huggingface.co/datasets/RudrakshNanavaty/earnings-call-data/resolve/main/sweetviz_episodes.html) **Download with Python** ([`huggingface_hub`](https://huggingface.co/docs/huggingface_hub)): ```python from huggingface_hub import hf_hub_download path = hf_hub_download( repo_id="RudrakshNanavaty/earnings-call-data", filename="sweetviz_episodes.html", repo_type="dataset", ) print(path) # open this path in a browser ``` **Regenerate locally** (from the pipeline repo that produced these files): `uv run python pipeline/sweetviz_report.py data/episodes.parquet -o reports/sweetviz_episodes.html` Sweetviz is a third-party tool; report content reflects the table at generation time. --- ## Provenance - **Transcripts / call metadata:** same underlying universe and years as [`Bose345/sp500_earnings_transcripts`](https://huggingface.co/datasets/Bose345/sp500_earnings_transcripts) (this release **augments** those transcripts with market, SEC, and label columns; respect that dataset’s license and terms when redistributing derived work). - **Market data:** via [yfinance](https://github.com/ranaroussi/yfinance) (subject to Yahoo / vendor terms of use). - **Filings:** U.S. SEC EDGAR public data (comply with [SEC fair access](https://www.sec.gov/os/accessing-edgar-data) and rate-limiting expectations when re-fetching). - **XBRL fundamentals:** derived from SEC **company facts** (same public data policy as above); re-fetch only with a proper [User-Agent](https://www.sec.gov/os/accessing-edgar-data) and polite throughput. This package is a **processed merge** for research; it is not an official SEC or exchange product. --- ## Loading examples **pandas / PyArrow** ```python import pandas as pd ep = pd.read_parquet("episodes.parquet") print(ep.shape, ep.columns[:5].tolist()) # Optional: only episodes with SEC 8-K body text populated ep_8k = pd.read_parquet("episodes_press_release_8k.parquet") print(ep_8k.shape) # Optional: rows with at least headline XBRL (example) ep_xbrl = ep.dropna(subset=["xbrl_revenue", "xbrl_net_income"]) print(ep_xbrl.shape) ``` **Hugging Face `datasets`** (if you upload parquet to a Hub dataset repo) ```python from datasets import Dataset ds = Dataset.from_parquet("episodes.parquet") # or hf://datasets/<user>/<name>/path.parquet print(ds) ``` --- ## Use cases - Train or evaluate models on **text + tabular market context** with aligned **forward returns** and optional **reported fundamentals** (`xbrl_*`). - Build **RL environments** where observations include call text and pre-earnings features and rewards depend on realized moves (subject to your own leakage and causality checks). - Reproduce or extend the pipeline using the sibling repository that emits these files. --- ## Limitations - Rows may contain **nulls** where a source (e.g. a filing or price window) was missing; use the audit columns and null summaries in the Sweetviz report or your own QC. - **`xbrl_*` columns are intentionally sparse:** many episodes will have nulls (no CIK, no matching GAAP fact for the quarter, or `year < 2009`). Do not assume complete fundamentals coverage. - **Survivorship and sample bias** follow the upstream universe and filters. - **Non-stationarity:** financial regimes change; test generalization across time and sectors. --- ## Citation If you use this dataset, cite the **upstream transcript dataset** as its authors request, plus a citation or link to **this Hub dataset**. Example BibTeX skeleton (fill in author as appropriate): ```bibtex @misc{earnings_episodes_2026, title = {S\&P 500 Earnings Episodes (merged transcripts, prices, SEC, labels)}, author = {YOUR NAME OR ORG}, year = {2026}, howpublished = {\url{https://huggingface.co/datasets/RudrakshNanavaty/earnings-call-data}}, note = {Augments Bose345/sp500\_earnings\_transcripts (2005--2025); adds yfinance, SEC EDGAR-derived fields, and optional SEC XBRL companyfacts (us-gaap) on episodes from 2009+.} } ``` --- ## License This dataset card specifies **MIT** (`license: mit` in the frontmatter). You remain responsible for **upstream** terms (e.g. the Hugging Face transcript dataset, Yahoo/yfinance, SEC redistribution) when publishing or redistributing derived data.
提供机构:
RudrakshNanavaty
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作