HYL/NASDAQ-News-Multi-LLM-Scores
收藏Hugging Face2026-04-15 更新2026-04-26 收录
下载链接:
https://hf-mirror.com/datasets/HYL/NASDAQ-News-Multi-LLM-Scores
下载链接
链接失效反馈官方服务:
资源简介:
---
license: cc-by-nc-4.0
task_categories:
- text-classification
language:
- en
tags:
- finance
- sentiment-analysis
- risk-assessment
- llm-scoring
- multi-model
- nasdaq
- stock-market
- news
- reinforcement-learning
size_categories:
- 100K<n<1M
---
# NASDAQ News Multi-LLM Scores
**127,176 financial news articles scored by 11 state-of-the-art LLMs for sentiment and risk assessment.**
This dataset takes the same articles from [FNSPID](https://huggingface.co/datasets/Zihan1004/FNSPID) / [FinRL_DeepSeek](https://github.com/benstaf/FinRL_DeepSeek) and re-scores them using multiple LLMs with varying reasoning effort levels and summary inputs. It enables direct cross-model comparison of financial sentiment analysis on identical articles.
## Motivation
When we began using the FNSPID dataset for RL trading agent research, we encountered two practical issues:
1. **Cost of full-text scoring.** Many articles are thousands of tokens long. Scoring each one directly with reasoning-capable LLMs (GPT-5, o3, Claude Opus) is prohibitively expensive at scale — especially across multiple effort levels and models.
2. **Low quality of extractive summaries.** The original dataset includes four extractive summaries (LSA, Luhn, TextRank, LexRank). Upon manual inspection, we found these summaries often lose important semantic nuances — key context about whether a financial event is positive or negative is frequently missing or ambiguous, making them unreliable as scoring input.
Our solution: **generate high-quality abstractive summaries first, then score those summaries.** We used GPT-5 and o3 to produce concise, financially-relevant summaries that preserve the sentiment-critical context. These summaries can then be reused across many scoring models at a fraction of the cost of scoring full articles.
This also opened the door to a systematic study: **how does the summary quality and reasoning effort affect the resulting sentiment/risk scores?** We generated GPT-5 and GPT-5-mini summaries at 12 different reasoning × verbosity combinations each (4 × 3 grid), and scored them with models at varying effort levels. The results show that both summary quality and scoring effort produce meaningful differences (see [Analysis](#analysis) below).
## Key Features
- **11 scoring models**: Claude Opus 4.5, Claude Sonnet 4.5, Claude Haiku 4.5, GPT-5, o3, o4-mini, GPT-4.1, GPT-4.1-mini, GPT-4.1-nano, GPT-5-mini, GPT-5.4-nano
- **60 score columns** (30 sentiment + 30 risk) across different model/effort/input combinations
- **26 summary variants**: GPT-5 and GPT-5-mini summaries at 4 reasoning × 3 verbosity levels, plus o3 summaries
- **Same articles, different models** — enables apples-to-apples comparison
- **Effort level comparison** — how does reasoning effort (high/medium/low/minimal) affect scoring?
- **Summary input comparison** — how does the quality of input summary affect downstream scores?
## Dataset Structure
| File | Size | Description |
|------|------|-------------|
| `scores.parquet` | 12 MB | All 60 score columns + article metadata |
| `summaries.parquet` | 329 MB | Article text + core summaries used for scoring |
| `summaries_gpt5_grid.parquet` | 323 MB | GPT-5 summary variants (4 reasoning × 3 verbosity) |
| `summaries_gpt5mini_grid.parquet` | 320 MB | GPT-5-mini summary variants (4 reasoning × 3 verbosity) |
## Quick Start
```python
import pandas as pd
# Load scores only (12 MB)
scores = pd.read_parquet("hf://datasets/HYL/NASDAQ-News-Multi-LLM-Scores/scores.parquet")
# Compare Claude Opus vs GPT-5 sentiment
print(scores[['sentiment_opus_gpt5sum', 'sentiment_gpt5_high_gpt5sum']].describe())
# Check cross-model agreement
cols = [c for c in scores.columns if c.startswith('sentiment_') and 'gpt5sum' in c]
print(scores[cols].corr())
```
```python
# Load with summaries for full context
summaries = pd.read_parquet("hf://datasets/HYL/NASDAQ-News-Multi-LLM-Scores/summaries.parquet")
merged = scores.merge(summaries, on=['Date', 'Article_title', 'Stock_symbol'])
```
## Scoring Pipeline
```
Article (full text) ─────────────────→ o3-high, o4-mini-high (fulltext)
│
├─→ 4 extractive summaries ─────→ (included, not used for LLM scoring)
│
├─→ GPT-5 summary ──┬──→ Claude Opus/Sonnet/Haiku
│ (R=high V=high) ├──→ GPT-5 {high,med,low,min}
│ ├──→ o3-high
│ ├──→ GPT-5-mini
│ └──→ GPT-4.1-mini (5 summary variants)
│
├─→ o3 summary ─────┬──→ GPT-5 {high,med,low,min}
│ ├──→ o3 {high,med,low}
│ ├──→ o4-mini {high,med,low}
│ ├──→ GPT-4.1 / GPT-4.1-mini / GPT-4.1-nano
│
└─→ Title only ─────────→ GPT-5.4-nano (100% coverage)
```
## Model Versions
| Model | API Model ID | Snapshot | Scored |
|-------|-------------|----------|--------|
| Claude Opus 4.5 | `claude-opus-4-5` | — | 2026-01 |
| Claude Sonnet 4.5 | `claude-sonnet-4-5-20250929` | — | 2026-01 |
| Claude Haiku 4.5 | `claude-haiku-4-5-20251001` | — | 2026-01 |
| GPT-5 | `gpt-5` | `gpt-5-2025-08-07` | 2025-08 ~ 09 |
| GPT-5-mini | `gpt-5-mini` | `gpt-5-mini-2025-08-07` | 2025-08 ~ 09 |
| o3 | `o3` | — | 2025-07 ~ 09 |
| o4-mini | `o4-mini` | `o4-mini-2025-04-16` | 2025-07 ~ 09 |
| GPT-4.1 | `gpt-4.1` | `gpt-4.1-2025-04-14` | 2025-08 ~ 09 |
| GPT-4.1-mini | `gpt-4.1-mini` | `gpt-4.1-mini-2025-04-14` | 2025-08 ~ 09 |
| GPT-4.1-nano | `gpt-4.1-nano` | `gpt-4.1-nano-2025-04-14` | 2025-08 ~ 09 |
| GPT-5.4-nano | `gpt-5.4-nano` | — | 2026-04 |
**Summary generators:**
| Summary | Generator Model | Generated |
|---------|----------------|-----------|
| `gpt_5_summary` (12 R×V variants) | GPT-5 (`gpt-5-2025-08-07`) | 2025-08 ~ 09 |
| `gpt_5_mini_summary` (12 R×V variants) | GPT-5-mini (`gpt-5-mini-2025-08-07`) | 2025-08 ~ 09 |
| `o3_summary` | o3 | 2025-07 ~ 09 |
**Upstream (not included, already open-source):**
- DeepSeek V3 scores → [benstaf/FinRL_DeepSeek](https://github.com/benstaf/FinRL_DeepSeek) ([arXiv:2502.07393](https://arxiv.org/abs/2502.07393))
## Column Naming Convention
Score columns follow: `{sentiment|risk}_{model}_{effort}_{input_source}`
| Suffix | Meaning |
|--------|---------|
| `_gpt5sum` | Scored using GPT-5 generated summary |
| `_o3sum` | Scored using o3 generated summary |
| `_fulltext` | Scored directly from full article text |
| `_title` | Scored from article title only |
| `_gpt5sum_Rhigh_Vmed` | Scored using GPT-5 summary (reasoning=high, verbosity=medium) |
All scores are integer **1-5 scale** (1 = most negative/highest risk, 5 = most positive/lowest risk).
## Score Columns (scores.parquet)
### Metadata
| Column | Description |
|--------|-------------|
| Date | Publication date (YYYY-MM-DD) |
| Article_title | Article headline |
| Stock_symbol | Ticker symbol |
| Url | Source URL |
| Publisher | News publisher |
| Author | Article author |
### Full-text Scores (61% coverage)
| Column | Model | Effort |
|--------|-------|--------|
| `sentiment_o3_high_fulltext` | o3 | high |
| `risk_o3_medium_fulltext` | o3 | medium |
| `sentiment_o4mini_high_fulltext` | o4-mini | high |
| `risk_o4mini_medium_fulltext` | o4-mini | medium |
### Claude Models — by GPT-5 summary (61% coverage)
`sentiment_opus_gpt5sum`, `risk_opus_gpt5sum`, `sentiment_sonnet_gpt5sum`, `risk_sonnet_gpt5sum`, `sentiment_haiku_gpt5sum`, `risk_haiku_gpt5sum`
### GPT-5 — 4 effort levels × 2 summary sources (61% coverage)
By GPT-5 summary: `{s|r}_gpt5_{high|medium|low|minimal}_gpt5sum` (8 cols)
By o3 summary: `{s|r}_gpt5_{high|medium|low|minimal}_o3sum` (8 cols)
### o3 — 3 efforts × o3 summary + gpt5 summary (61% coverage)
`{s|r}_o3_{high|medium|low}_o3sum` (6 cols), `{s|r}_o3_high_gpt5sum` (2 cols)
### o4-mini — 3 efforts × o3 summary (61% coverage)
`{s|r}_o4mini_{high|medium|low}_o3sum` (6 cols)
### GPT-4.1 family (61% coverage)
| Column pattern | Model | Summary input |
|----------------|-------|---------------|
| `{s|r}_gpt41_o3sum` | GPT-4.1 | o3 summary |
| `{s|r}_gpt41mini_gpt5sum_R{x}_V{y}` | GPT-4.1-mini | GPT-5 summary (5 variants) |
| `{s|r}_gpt41mini_o3sum` | GPT-4.1-mini | o3 summary |
| `{s|r}_gpt41nano_o3sum` | GPT-4.1-nano | o3 summary |
### GPT-5-mini (61% coverage)
`sentiment_gpt5mini_high_gpt5sum`, `risk_gpt5mini_high_gpt5sum`
### GPT-5.4-nano — title only (100% coverage)
`sentiment_nano_title`, `risk_nano_title`
## Summary Files
### summaries.parquet — Core summaries used for scoring
| Column | Generator | Coverage |
|--------|-----------|----------|
| Article | — (original text) | 61% |
| Lsa_summary | LSA algorithm | 61% |
| Luhn_summary | Luhn algorithm | 61% |
| Textrank_summary | TextRank | 61% |
| Lexrank_summary | LexRank | 61% |
| gpt_5_summary | GPT-5 (R=high, V=high) | 61% |
| o3_summary | o3 | 61% |
### summaries_gpt5_grid.parquet — GPT-5 summary reasoning × verbosity grid
12 columns: `gpt5_R{reasoning}_V{verbosity}` where reasoning ∈ {high, medium, low, minimal} and verbosity ∈ {high, medium, low}.
`gpt5_Rhigh_Vhigh` is the same text as `gpt_5_summary` in summaries.parquet.
### summaries_gpt5mini_grid.parquet — GPT-5-mini summary grid
Same 4×3 structure: `gpt5mini_R{reasoning}_V{verbosity}`
## Coverage
- **127,176 total articles** (89 NASDAQ tickers, 2009–2024)
- **61.2%** (77,871) have LLM summaries and summary-based scores (articles with text content)
- **100%** for GPT-5.4-nano scores (title-only, no article text needed)
- **38.8%** of rows lack article content in the source data → NaN for all summary-based columns
## Analysis
### Cross-Model Correlation
Models using the same summary input form high-correlation clusters. Nano (title-only) is the outlier — limited input produces systematically different scores.

### Score Distribution
Each model has distinct scoring tendencies. Nano is most conservative (60.8% neutral), o3 is most opinionated (37.1% neutral).

### Effort Level Impact
Higher reasoning effort produces meaningfully different scores. Adjacent levels differ ~12%, high vs minimal: 28%.

### Summary Source Impact
Switching from GPT-5 summary to o3 summary changes 17–22% of scores, with minimal-effort scoring being most sensitive.

### RL Trading Results
Using these scores for RL trading agent training (PPO, SB3):
- **GPT-5 high vs low effort**: 16.6% disagreement rate on same articles; high vs minimal: 28.0%
- **GPT-5 summary vs o3 summary**: 18.2% disagreement — summary source affects scoring comparably to effort level
- **Best single-model Sharpe**: PPO with GPT-5-mini high scores achieved Sharpe 1.032 on NASDAQ backtest (2019-2023)
- **Multi-seed validation** (5 seeds × 4 algorithms): PPO most robust (0.777±0.098), SAC highest mean (0.780±0.047)
## Citation
If you use this dataset, please cite:
```bibtex
@dataset{hyl2026nasdaq_multi_llm,
title={NASDAQ News Multi-LLM Scores},
author={HYL},
year={2026},
url={https://huggingface.co/datasets/HYL/NASDAQ-News-Multi-LLM-Scores},
note={Multi-LLM re-scoring of FNSPID financial news articles}
}
```
Original data sources:
```bibtex
@misc{dong2024fnspid,
title={FNSPID: A Comprehensive Financial News Dataset in Time Series},
author={Zihan Dong and Xinyu Fan and Zhiyuan Peng},
year={2024},
eprint={2402.06698},
archivePrefix={arXiv}
}
@misc{staf2025finrl,
title={Enhancing Financial Trading with LLM-Augmented Sentiment Analysis},
author={Ben Staf},
year={2025},
eprint={2502.07393},
archivePrefix={arXiv}
}
```
## License
CC BY-NC 4.0 (inherited from upstream FNSPID dataset). Non-commercial use only.
## Links
- Source code: [ArkScope](https://github.com/HYL-Dave/ArkScope)
- Original dataset: [FNSPID](https://huggingface.co/datasets/Zihan1004/FNSPID)
- FinRL_DeepSeek: [arXiv:2502.07393](https://arxiv.org/abs/2502.07393)
提供机构:
HYL



