HYL/NASDAQ-News-Multi-LLM-Scores

Hugging Face2026-04-15 更新2026-04-26 收录

下载链接：

https://hf-mirror.com/datasets/HYL/NASDAQ-News-Multi-LLM-Scores

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: cc-by-nc-4.0 task_categories: - text-classification language: - en tags: - finance - sentiment-analysis - risk-assessment - llm-scoring - multi-model - nasdaq - stock-market - news - reinforcement-learning size_categories: - 100K<n<1M --- # NASDAQ News Multi-LLM Scores **127,176 financial news articles scored by 11 state-of-the-art LLMs for sentiment and risk assessment.** This dataset takes the same articles from [FNSPID](https://huggingface.co/datasets/Zihan1004/FNSPID) / [FinRL_DeepSeek](https://github.com/benstaf/FinRL_DeepSeek) and re-scores them using multiple LLMs with varying reasoning effort levels and summary inputs. It enables direct cross-model comparison of financial sentiment analysis on identical articles. ## Motivation When we began using the FNSPID dataset for RL trading agent research, we encountered two practical issues: 1. **Cost of full-text scoring.** Many articles are thousands of tokens long. Scoring each one directly with reasoning-capable LLMs (GPT-5, o3, Claude Opus) is prohibitively expensive at scale — especially across multiple effort levels and models. 2. **Low quality of extractive summaries.** The original dataset includes four extractive summaries (LSA, Luhn, TextRank, LexRank). Upon manual inspection, we found these summaries often lose important semantic nuances — key context about whether a financial event is positive or negative is frequently missing or ambiguous, making them unreliable as scoring input. Our solution: **generate high-quality abstractive summaries first, then score those summaries.** We used GPT-5 and o3 to produce concise, financially-relevant summaries that preserve the sentiment-critical context. These summaries can then be reused across many scoring models at a fraction of the cost of scoring full articles. This also opened the door to a systematic study: **how does the summary quality and reasoning effort affect the resulting sentiment/risk scores?** We generated GPT-5 and GPT-5-mini summaries at 12 different reasoning × verbosity combinations each (4 × 3 grid), and scored them with models at varying effort levels. The results show that both summary quality and scoring effort produce meaningful differences (see [Analysis](#analysis) below). ## Key Features - **11 scoring models**: Claude Opus 4.5, Claude Sonnet 4.5, Claude Haiku 4.5, GPT-5, o3, o4-mini, GPT-4.1, GPT-4.1-mini, GPT-4.1-nano, GPT-5-mini, GPT-5.4-nano - **60 score columns** (30 sentiment + 30 risk) across different model/effort/input combinations - **26 summary variants**: GPT-5 and GPT-5-mini summaries at 4 reasoning × 3 verbosity levels, plus o3 summaries - **Same articles, different models** — enables apples-to-apples comparison - **Effort level comparison** — how does reasoning effort (high/medium/low/minimal) affect scoring? - **Summary input comparison** — how does the quality of input summary affect downstream scores? ## Dataset Structure | File | Size | Description | |------|------|-------------| | `scores.parquet` | 12 MB | All 60 score columns + article metadata | | `summaries.parquet` | 329 MB | Article text + core summaries used for scoring | | `summaries_gpt5_grid.parquet` | 323 MB | GPT-5 summary variants (4 reasoning × 3 verbosity) | | `summaries_gpt5mini_grid.parquet` | 320 MB | GPT-5-mini summary variants (4 reasoning × 3 verbosity) | ## Quick Start ```python import pandas as pd # Load scores only (12 MB) scores = pd.read_parquet("hf://datasets/HYL/NASDAQ-News-Multi-LLM-Scores/scores.parquet") # Compare Claude Opus vs GPT-5 sentiment print(scores[['sentiment_opus_gpt5sum', 'sentiment_gpt5_high_gpt5sum']].describe()) # Check cross-model agreement cols = [c for c in scores.columns if c.startswith('sentiment_') and 'gpt5sum' in c] print(scores[cols].corr()) ``` ```python # Load with summaries for full context summaries = pd.read_parquet("hf://datasets/HYL/NASDAQ-News-Multi-LLM-Scores/summaries.parquet") merged = scores.merge(summaries, on=['Date', 'Article_title', 'Stock_symbol']) ``` ## Scoring Pipeline ``` Article (full text) ─────────────────→ o3-high, o4-mini-high (fulltext) │ ├─→ 4 extractive summaries ─────→ (included, not used for LLM scoring) │ ├─→ GPT-5 summary ──┬──→ Claude Opus/Sonnet/Haiku │ (R=high V=high) ├──→ GPT-5 {high,med,low,min} │ ├──→ o3-high │ ├──→ GPT-5-mini │ └──→ GPT-4.1-mini (5 summary variants) │ ├─→ o3 summary ─────┬──→ GPT-5 {high,med,low,min} │ ├──→ o3 {high,med,low} │ ├──→ o4-mini {high,med,low} │ ├──→ GPT-4.1 / GPT-4.1-mini / GPT-4.1-nano │ └─→ Title only ─────────→ GPT-5.4-nano (100% coverage) ``` ## Model Versions | Model | API Model ID | Snapshot | Scored | |-------|-------------|----------|--------| | Claude Opus 4.5 | `claude-opus-4-5` | — | 2026-01 | | Claude Sonnet 4.5 | `claude-sonnet-4-5-20250929` | — | 2026-01 | | Claude Haiku 4.5 | `claude-haiku-4-5-20251001` | — | 2026-01 | | GPT-5 | `gpt-5` | `gpt-5-2025-08-07` | 2025-08 ~ 09 | | GPT-5-mini | `gpt-5-mini` | `gpt-5-mini-2025-08-07` | 2025-08 ~ 09 | | o3 | `o3` | — | 2025-07 ~ 09 | | o4-mini | `o4-mini` | `o4-mini-2025-04-16` | 2025-07 ~ 09 | | GPT-4.1 | `gpt-4.1` | `gpt-4.1-2025-04-14` | 2025-08 ~ 09 | | GPT-4.1-mini | `gpt-4.1-mini` | `gpt-4.1-mini-2025-04-14` | 2025-08 ~ 09 | | GPT-4.1-nano | `gpt-4.1-nano` | `gpt-4.1-nano-2025-04-14` | 2025-08 ~ 09 | | GPT-5.4-nano | `gpt-5.4-nano` | — | 2026-04 | **Summary generators:** | Summary | Generator Model | Generated | |---------|----------------|-----------| | `gpt_5_summary` (12 R×V variants) | GPT-5 (`gpt-5-2025-08-07`) | 2025-08 ~ 09 | | `gpt_5_mini_summary` (12 R×V variants) | GPT-5-mini (`gpt-5-mini-2025-08-07`) | 2025-08 ~ 09 | | `o3_summary` | o3 | 2025-07 ~ 09 | **Upstream (not included, already open-source):** - DeepSeek V3 scores → [benstaf/FinRL_DeepSeek](https://github.com/benstaf/FinRL_DeepSeek) ([arXiv:2502.07393](https://arxiv.org/abs/2502.07393)) ## Column Naming Convention Score columns follow: `{sentiment|risk}_{model}_{effort}_{input_source}` | Suffix | Meaning | |--------|---------| | `_gpt5sum` | Scored using GPT-5 generated summary | | `_o3sum` | Scored using o3 generated summary | | `_fulltext` | Scored directly from full article text | | `_title` | Scored from article title only | | `_gpt5sum_Rhigh_Vmed` | Scored using GPT-5 summary (reasoning=high, verbosity=medium) | All scores are integer **1-5 scale** (1 = most negative/highest risk, 5 = most positive/lowest risk). ## Score Columns (scores.parquet) ### Metadata | Column | Description | |--------|-------------| | Date | Publication date (YYYY-MM-DD) | | Article_title | Article headline | | Stock_symbol | Ticker symbol | | Url | Source URL | | Publisher | News publisher | | Author | Article author | ### Full-text Scores (61% coverage) | Column | Model | Effort | |--------|-------|--------| | `sentiment_o3_high_fulltext` | o3 | high | | `risk_o3_medium_fulltext` | o3 | medium | | `sentiment_o4mini_high_fulltext` | o4-mini | high | | `risk_o4mini_medium_fulltext` | o4-mini | medium | ### Claude Models — by GPT-5 summary (61% coverage) `sentiment_opus_gpt5sum`, `risk_opus_gpt5sum`, `sentiment_sonnet_gpt5sum`, `risk_sonnet_gpt5sum`, `sentiment_haiku_gpt5sum`, `risk_haiku_gpt5sum` ### GPT-5 — 4 effort levels × 2 summary sources (61% coverage) By GPT-5 summary: `{s|r}_gpt5_{high|medium|low|minimal}_gpt5sum` (8 cols) By o3 summary: `{s|r}_gpt5_{high|medium|low|minimal}_o3sum` (8 cols) ### o3 — 3 efforts × o3 summary + gpt5 summary (61% coverage) `{s|r}_o3_{high|medium|low}_o3sum` (6 cols), `{s|r}_o3_high_gpt5sum` (2 cols) ### o4-mini — 3 efforts × o3 summary (61% coverage) `{s|r}_o4mini_{high|medium|low}_o3sum` (6 cols) ### GPT-4.1 family (61% coverage) | Column pattern | Model | Summary input | |----------------|-------|---------------| | `{s|r}_gpt41_o3sum` | GPT-4.1 | o3 summary | | `{s|r}_gpt41mini_gpt5sum_R{x}_V{y}` | GPT-4.1-mini | GPT-5 summary (5 variants) | | `{s|r}_gpt41mini_o3sum` | GPT-4.1-mini | o3 summary | | `{s|r}_gpt41nano_o3sum` | GPT-4.1-nano | o3 summary | ### GPT-5-mini (61% coverage) `sentiment_gpt5mini_high_gpt5sum`, `risk_gpt5mini_high_gpt5sum` ### GPT-5.4-nano — title only (100% coverage) `sentiment_nano_title`, `risk_nano_title` ## Summary Files ### summaries.parquet — Core summaries used for scoring | Column | Generator | Coverage | |--------|-----------|----------| | Article | — (original text) | 61% | | Lsa_summary | LSA algorithm | 61% | | Luhn_summary | Luhn algorithm | 61% | | Textrank_summary | TextRank | 61% | | Lexrank_summary | LexRank | 61% | | gpt_5_summary | GPT-5 (R=high, V=high) | 61% | | o3_summary | o3 | 61% | ### summaries_gpt5_grid.parquet — GPT-5 summary reasoning × verbosity grid 12 columns: `gpt5_R{reasoning}_V{verbosity}` where reasoning ∈ {high, medium, low, minimal} and verbosity ∈ {high, medium, low}. `gpt5_Rhigh_Vhigh` is the same text as `gpt_5_summary` in summaries.parquet. ### summaries_gpt5mini_grid.parquet — GPT-5-mini summary grid Same 4×3 structure: `gpt5mini_R{reasoning}_V{verbosity}` ## Coverage - **127,176 total articles** (89 NASDAQ tickers, 2009–2024) - **61.2%** (77,871) have LLM summaries and summary-based scores (articles with text content) - **100%** for GPT-5.4-nano scores (title-only, no article text needed) - **38.8%** of rows lack article content in the source data → NaN for all summary-based columns ## Analysis ### Cross-Model Correlation Models using the same summary input form high-correlation clusters. Nano (title-only) is the outlier — limited input produces systematically different scores. ![Cross-Model Correlation](figures/fig_correlation_heatmap.png) ### Score Distribution Each model has distinct scoring tendencies. Nano is most conservative (60.8% neutral), o3 is most opinionated (37.1% neutral). ![Score Distribution](figures/fig_score_distribution.png) ### Effort Level Impact Higher reasoning effort produces meaningfully different scores. Adjacent levels differ ~12%, high vs minimal: 28%. ![Effort Disagreement](figures/fig_effort_disagreement.png) ### Summary Source Impact Switching from GPT-5 summary to o3 summary changes 17–22% of scores, with minimal-effort scoring being most sensitive. ![Summary Impact](figures/fig_summary_impact.png) ### RL Trading Results Using these scores for RL trading agent training (PPO, SB3): - **GPT-5 high vs low effort**: 16.6% disagreement rate on same articles; high vs minimal: 28.0% - **GPT-5 summary vs o3 summary**: 18.2% disagreement — summary source affects scoring comparably to effort level - **Best single-model Sharpe**: PPO with GPT-5-mini high scores achieved Sharpe 1.032 on NASDAQ backtest (2019-2023) - **Multi-seed validation** (5 seeds × 4 algorithms): PPO most robust (0.777±0.098), SAC highest mean (0.780±0.047) ## Citation If you use this dataset, please cite: ```bibtex @dataset{hyl2026nasdaq_multi_llm, title={NASDAQ News Multi-LLM Scores}, author={HYL}, year={2026}, url={https://huggingface.co/datasets/HYL/NASDAQ-News-Multi-LLM-Scores}, note={Multi-LLM re-scoring of FNSPID financial news articles} } ``` Original data sources: ```bibtex @misc{dong2024fnspid, title={FNSPID: A Comprehensive Financial News Dataset in Time Series}, author={Zihan Dong and Xinyu Fan and Zhiyuan Peng}, year={2024}, eprint={2402.06698}, archivePrefix={arXiv} } @misc{staf2025finrl, title={Enhancing Financial Trading with LLM-Augmented Sentiment Analysis}, author={Ben Staf}, year={2025}, eprint={2502.07393}, archivePrefix={arXiv} } ``` ## License CC BY-NC 4.0 (inherited from upstream FNSPID dataset). Non-commercial use only. ## Links - Source code: [ArkScope](https://github.com/HYL-Dave/ArkScope) - Original dataset: [FNSPID](https://huggingface.co/datasets/Zihan1004/FNSPID) - FinRL_DeepSeek: [arXiv:2502.07393](https://arxiv.org/abs/2502.07393)

提供机构：

HYL

5,000+

优质数据集

54 个

任务类型

进入经典数据集