TheTokenFactory/sec-contracts-financial-extraction-instructions
收藏Hugging Face2026-04-10 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/TheTokenFactory/sec-contracts-financial-extraction-instructions
下载链接
链接失效反馈官方服务:
资源简介:
---
license: cc-by-4.0
task_categories:
- text-generation
- token-classification
language:
- en
tags:
- finance
- financial-nlp
- sec-filings
- sec-edgar
- structured-extraction
- information-extraction
- named-entity-recognition
- ner
- legal
- contracts
- debt-covenants
- executive-compensation
- proxy-statements
- def-14a
- credit-agreements
- instruction-tuning
- fine-tuning
- sharegpt
- alpaca
- chatml
- json-extraction
- sp500
- nlp
dataset_info:
- config_name: sharegpt
features:
- name: conversations
list:
- name: from
dtype: string
- name: value
dtype: string
- name: metadata
struct:
- name: source_file
dtype: string
- name: chunk_type
dtype: string
- name: task_type
dtype: string
- name: company
dtype: string
- name: ticker
dtype: string
- name: pipeline
dtype: string
- name: model_version
dtype: string
- name: iteration
dtype: string
- name: confidence_min
dtype: float64
- name: example_type
dtype: string
- name: negative_reason
dtype: string
- name: drops_count
dtype: int64
- name: rescued_count
dtype: int64
- name: rescue_gates
dtype: string
- name: has_noncanonical_term_type
dtype: bool
- name: has_noncanonical_covenant_type
dtype: bool
- name: has_noncanonical_comp_type
dtype: bool
- name: has_dollar_on_shares
dtype: bool
- name: has_bare_share_count
dtype: bool
- config_name: alpaca
features:
- name: instruction
dtype: string
- name: input
dtype: string
- name: output
dtype: string
- name: metadata
struct:
- name: source_file
dtype: string
- name: chunk_type
dtype: string
- name: task_type
dtype: string
- name: company
dtype: string
- name: ticker
dtype: string
- name: pipeline
dtype: string
- name: model_version
dtype: string
- name: iteration
dtype: string
- name: confidence_min
dtype: float64
- name: example_type
dtype: string
- name: negative_reason
dtype: string
- name: drops_count
dtype: int64
- name: rescued_count
dtype: int64
- name: rescue_gates
dtype: string
- name: has_noncanonical_term_type
dtype: bool
- name: has_noncanonical_covenant_type
dtype: bool
- name: has_noncanonical_comp_type
dtype: bool
- name: has_dollar_on_shares
dtype: bool
- name: has_bare_share_count
dtype: bool
- config_name: openai
features:
- name: messages
list:
- name: role
dtype: string
- name: content
dtype: string
configs:
- config_name: sharegpt
default: true
data_files:
- split: train
path: "data/sharegpt_financial_extraction.jsonl"
- split: corrective
path: "data/sharegpt_corrective.jsonl"
- config_name: alpaca
data_files:
- split: train
path: "data/alpaca_financial_extraction.jsonl"
- split: corrective
path: "data/alpaca_corrective.jsonl"
- config_name: openai
data_files:
- split: train
path: "data/openai_financial_extraction.jsonl"
- split: corrective
path: "data/openai_corrective.jsonl"
size_categories:
- 1K<n<10K
pretty_name: SEC Financial Extraction Instructions (Exhibit 10 + DEF 14A)
---
# S&P 500 SEC Financial Extraction Instructions
## Dataset Summary
**7,683 instruction-tuning examples** for training LLMs to extract structured financial data from SEC filings. Covers two filing types across S&P 500 companies:
| Split | Examples | Filing Type | Description |
|-------|----------|-------------|-------------|
| **train** | 3,430 | Exhibit 10 + DEF 14A | Positive examples with validated outputs |
| **corrective** | 4,253 | Exhibit 10 + DEF 14A | Corrective, rescued, and negative examples |
### Exhibit 10 — Material Contracts (2,726 positive + 3,060 corrective)
| Task | Positive | Corrective | Description |
|------|----------|------------|-------------|
| Metadata Extraction | 1,028 | 1,027 | Effective dates and contracting party names |
| Financial Term Extraction | 1,434 | 1,600 | Dollar amounts, percentages, 13 term types |
| Covenant Extraction | 264 | 433 | Debt covenants, thresholds, 7 covenant types |
### DEF 14A — Proxy Statements (704 positive + 1,193 corrective)
| Task | Positive | Corrective | Description |
|------|----------|------------|-------------|
| Exec Metadata Extraction | 150 | 150 | Named Executive Officers, CEO/CFO identification |
| Compensation Extraction | 293 | 750 | Executive compensation with 9 comp types |
| Governance Extraction | 261 | 293 | Say-on-pay, clawback policies, peer groups |
### Source
- **Filings:** SEC EDGAR EX-10 exhibits (8-K, 10-K, 10-Q) and DEF 14A proxy statements
- **Companies:** 368 unique S&P 500 companies
- **Documents:** 1,028 material contracts + 150 proxy statements
- **Extraction model:** Gemma 4 2B (Q4_K_M quantized) at temperature 0.1
## Extraction Tasks
### Exhibit 10 Tasks
**Metadata Extraction** — Given a contract preamble, extract effective date and contracting parties.
```json
{"effective_date": "YYYY-MM-DD", "primary_party_1": "Name", "primary_party_2": "Name"}
```
**Financial Term Extraction** — Extract monetary values with 13 term types: salary, bonus, severance, retirement_benefit, equity_grant, credit_facility, loan_amount, interest_rate, fee, threshold, purchase_price, compensation, other.
```json
{"financial_values": [{"value": "$1,500,000", "definition": "Annual base salary for CEO", "term_type": "salary"}]}
```
**Covenant Extraction** — Extract debt covenants with 7 types: leverage_ratio, interest_coverage, debt_service, net_worth, liquidity, fixed_charge, other.
```json
{"covenants": [{"covenant_type": "leverage_ratio", "threshold_value": "3.50x", "definition": "Maximum Consolidated Leverage Ratio"}]}
```
### DEF 14A Tasks
**Exec Metadata Extraction** — Identify Named Executive Officers and fiscal year.
```json
{"fiscal_year": "2025", "named_executive_officers": [{"name": "Jane Doe", "title": "CEO", "is_ceo": true, "is_cfo": false}]}
```
**Compensation Extraction** — Extract compensation values with 9 types: base_salary, stock_award, option_award, non_equity_incentive, pension_change, other_comp, total_comp, severance, ceo_pay_ratio.
```json
{"compensation_values": [{"executive_name": "Jane Doe", "value": "$1,200,000", "comp_type": "base_salary", "definition": "Annual base salary", "fiscal_year": "2025"}]}
```
**Governance Extraction** — Extract say-on-pay votes, clawback policies, and peer group compositions.
```json
{"governance_items": [{"item_type": "say_on_pay", "value": "94.2%", "definition": "Advisory vote approval percentage", "fiscal_year": "2025"}]}
```
## Corrective Training Data
The `corrective` split teaches models to avoid common extraction errors:
| Example Type | Count | Purpose |
|---|---|---|
| **Positive (corrected)** | 1,968 | Post-validation cleaned output as ground truth |
| **Corrective (rescued)** | 95 | Errors corrected by validation (e.g., `$3,205` on share counts -> `3,205 shares`) |
| **Negative** | 2,190 | Inputs with no valid data — teaches empty output |
### Key Corrective Signals
- **Dollar-on-shares:** Model puts `$` on share counts — corrected to bare number + "shares"
- **Bare share counts:** Model omits "shares" label on unit counts — corrected
- **Hallucination phrases:** Model fabricates definitions ("does not contain", "no specific") — teaches empty output
- **Column header names:** Model extracts "Named Executive Officer" as an exec name — dropped
## Dataset Formats
Three standard fine-tuning formats with identical examples:
| Format | File | Best For |
|--------|------|----------|
| **ShareGPT** | `sharegpt_*.jsonl` | Axolotl, Unsloth, LLaMA-Factory |
| **Alpaca** | `alpaca_*.jsonl` | Stanford Alpaca format tools |
| **OpenAI** | `openai_*.jsonl` | OpenAI fine-tuning API, HuggingFace TRL |
## Data Fields
### Metadata Fields (ShareGPT and Alpaca formats)
| Field | Type | Description |
|-------|------|-------------|
| `source_file` | string | SEC filing filename |
| `chunk_type` | string | `metadata`, `financial`, `covenant`, `exec_preamble`, `comp_table`, etc. |
| `task_type` | string | `metadata_extraction`, `financial_extraction`, `compensation_extraction`, etc. |
| `pipeline` | string | `exhibit10` or `proxy` |
| `company` | string | Canonical S&P 500 company name |
| `ticker` | string | Stock ticker symbol |
| `confidence_min` | float | Minimum extraction confidence (0.0-1.0) |
| `example_type` | string | `positive_corrected`, `corrective`, or `negative` (corrective split only) |
| `has_dollar_on_shares` | bool | True if corrective example fixes dollar sign on share counts |
| `has_bare_share_count` | bool | True if corrective example fixes missing "shares" label |
## Quality Filters Applied
### Exhibit 10
- All `"NONE"` values removed (27 excluded)
- Bare `$`/`%` symbols removed (58 excluded)
- Confidence < 0.7 removed (72 excluded)
- Short source text < 50 chars removed (5 excluded)
### DEF 14A Proxy
- Base salary > $5M reclassified to total_comp (35 reclassified)
- Dollar signs on share counts rescued (170 corrected)
- Bare share counts rescued (24 corrected)
- Bare number symbols rescued via definition context (30 corrected)
- Column header names dropped (199 dropped)
- Empty/null governance values dropped (249 dropped)
- Say-on-pay < 50% dropped (9 dropped)
- Director fees reclassified from base_salary to other_comp (20 reclassified)
## Dataset Creation
### Extraction Pipeline
A 6-stage Python pipeline processes raw HTML/TXT filings:
1. **Harvester** — Downloads exhibits from SEC EDGAR
2. **Chopper** — Extracts targeted text blocks using section boundary detection
3. **Extractor** — Routes chunks to task-specific LLM prompts (Gemma 4 2B, temperature 0.1)
4. **Reducer** — Validates through 14+ quality gates, normalizes values, reclassifies mistyped terms
5. **Normalizer** — Resolves entity names to S&P 500 canonical names via CIK lookup
6. **Training Data Generator** — Joins raw inputs with validated outputs, applies quality filters
### Important Note on Labels
Extractions were produced by a 2B parameter model, not human annotators. While quality gates filter obvious errors, these are **silver-standard labels** — suitable for fine-tuning but not for use as a gold-standard evaluation benchmark.
## Intended Uses
- **Fine-tuning small LLMs** (1B-7B) for structured financial document extraction
- **Domain adaptation** for models that need SEC filing understanding
- **Instruction-tuning** for JSON-structured output from financial text
- **Research** on information extraction from legal/financial documents
## Limitations
- **Temporal scope:** 6-month filing window (not a historical backtest)
- **Universe:** S&P 500 only (large-cap US equities)
- **Language:** English only
- **Label quality:** Silver-standard (model-generated, not human-annotated)
- **Model bias:** Gemma 4 2B may have systematic extraction patterns that transfer to fine-tuned models
- **Proxy coverage:** 150 of 500 S&P 500 companies had DEF 14A filings processed
## Citation
```bibtex
@dataset{thetokenfactory2026sp500secextraction,
title={S&P 500 SEC Financial Extraction Instructions},
author={TheTokenFactory},
year={2026},
publisher={HuggingFace},
url={https://huggingface.co/datasets/TheTokenFactory/sec-contracts-financial-extraction-instructions}
}
```
## License
This dataset is released under [CC-BY-4.0](https://creativecommons.org/licenses/by/4.0/). SEC filings are public domain; this dataset's value is in the structured extraction, quality filtering, and instruction-tuning format.
---
许可协议:知识共享署名4.0(CC-BY-4.0)
任务类别:
- 文本生成
- 令牌分类(Token)
语言:
- 英语
标签:
- 金融
- 金融自然语言处理(Financial NLP)
- SEC文件(SEC filings)
- SEC EDGAR
- 结构化抽取
- 信息抽取
- 命名实体识别(Named Entity Recognition)
- NER
- 法律
- 合同
- 债务契约
- 高管薪酬
- 代理声明
- DEF 14A
- 信贷协议
- 指令微调
- 微调
- ShareGPT
- Alpaca
- ChatML
- JSON抽取
- 标普500(S&P 500)
- 自然语言处理(NLP)
dataset_info:
- config_name: sharegpt
features:
- name: conversations
list:
- name: from
dtype: string
- name: value
dtype: string
- name: metadata
struct:
- name: source_file
dtype: string
- name: chunk_type
dtype: string
- name: task_type
dtype: string
- name: company
dtype: string
- name: ticker
dtype: string
- name: pipeline
dtype: string
- name: model_version
dtype: string
- name: iteration
dtype: string
- name: confidence_min
dtype: float64
- name: example_type
dtype: string
- name: negative_reason
dtype: string
- name: drops_count
dtype: int64
- name: rescued_count
dtype: int64
- name: rescue_gates
dtype: string
- name: has_noncanonical_term_type
dtype: bool
- name: has_noncanonical_covenant_type
dtype: bool
- name: has_noncanonical_comp_type
dtype: bool
- name: has_dollar_on_shares
dtype: bool
- name: has_bare_share_count
dtype: bool
- config_name: alpaca
features:
- name: instruction
dtype: string
- name: input
dtype: string
- name: output
dtype: string
- name: metadata
struct:
- name: source_file
dtype: string
- name: chunk_type
dtype: string
- name: task_type
dtype: string
- name: company
dtype: string
- name: ticker
dtype: string
- name: pipeline
dtype: string
- name: model_version
dtype: string
- name: iteration
dtype: string
- name: confidence_min
dtype: float64
- name: example_type
dtype: string
- name: negative_reason
dtype: string
- name: drops_count
dtype: int64
- name: rescued_count
dtype: int64
- name: rescue_gates
dtype: string
- name: has_noncanonical_term_type
dtype: bool
- name: has_noncanonical_covenant_type
dtype: bool
- name: has_noncanonical_comp_type
dtype: bool
- name: has_dollar_on_shares
dtype: bool
- name: has_bare_share_count
dtype: bool
- config_name: openai
features:
- name: messages
list:
- name: role
dtype: string
- name: content
dtype: string
configs:
- config_name: sharegpt
default: true
data_files:
- split: train
path: "data/sharegpt_financial_extraction.jsonl"
- split: corrective
path: "data/sharegpt_corrective.jsonl"
- config_name: alpaca
data_files:
- split: train
path: "data/alpaca_financial_extraction.jsonl"
- split: corrective
path: "data/alpaca_corrective.jsonl"
- config_name: openai
data_files:
- split: train
path: "data/openai_financial_extraction.jsonl"
- split: corrective
path: "data/openai_corrective.jsonl"
size_categories:
- 1K<n<10K
pretty_name: SEC金融抽取指令集(附件10 + DEF 14A)
---
# 标普500(S&P 500)SEC金融抽取指令集
## 数据集概览
**共7683条指令微调示例**,用于训练大语言模型(LLM/Large Language Model)从SEC文件中抽取结构化金融数据,覆盖标普500公司的两类申报文件:
| 数据集拆分 | 示例数量 | 申报文件类型 | 说明 |
|-------|----------|-------------|-------------|
| **训练集(train)** | 3430 | 附件10 + DEF 14A | 经过验证输出的正样本示例 |
| **校正集(corrective)** | 4253 | 附件10 + DEF 14A | 校正、挽救及负样本示例 |
### 附件10 — 重大合同(2726条正样本 + 3060条校正样本)
| 任务类型 | 正样本数 | 校正样本数 | 说明 |
|------|----------|------------|-------------|
| 元数据抽取 | 1028 | 1027 | 生效日期与合同方名称 |
| 金融术语抽取 | 1434 | 1600 | 美元金额、百分比及13类术语 |
| 契约抽取 | 264 | 433 | 债务契约、阈值及7类契约类型 |
### DEF 14A — 代理声明(704条正样本 + 1193条校正样本)
| 任务类型 | 正样本数 | 校正样本数 | 说明 |
|------|----------|------------|-------------|
| 高管元数据抽取 | 150 | 150 | 命名高管、CEO/CFO识别 |
| 薪酬抽取 | 293 | 750 | 覆盖9类薪酬类型的高管薪酬数据 |
| 治理抽取 | 261 | 293 | 股东表决薪酬意见、追回政策及同行群体 |
### 数据来源
- **申报文件**:SEC EDGAR的EX-10附件(8-K、10-K、10-Q文件)及DEF 14A代理声明
- **覆盖公司**:368家独立的标普500公司
- **文档数量**:1028份重大合同 + 150份代理声明
- **抽取模型**:Gemma 4 2B(Q4_K_M量化版本),温度参数设为0.1
## 抽取任务
### 附件10抽取任务
**元数据抽取**:给定合同序言,提取生效日期与合同方名称。
json
{"effective_date": "YYYY-MM-DD", "primary_party_1": "Name", "primary_party_2": "Name"}
**金融术语抽取**:抽取覆盖13类术语的货币价值:薪酬、奖金、遣散费、退休福利、股权激励、信贷额度、贷款金额、利率、费用、阈值、收购价格、薪酬及其他。
json
{"financial_values": [{"value": "$1,500,000", "definition": "Annual base salary for CEO", "term_type": "salary"}]}
**契约抽取**:抽取覆盖7类类型的债务契约:杠杆率、利息保障倍数、债务偿付、净资产、流动性、固定费用及其他。
json
{"covenants": [{"covenant_type": "leverage_ratio", "threshold_value": "3.50x", "definition": "Maximum Consolidated Leverage Ratio"}]}
### DEF 14A抽取任务
**高管元数据抽取**:识别命名高管及财年。
json
{"fiscal_year": "2025", "named_executive_officers": [{"name": "Jane Doe", "title": "CEO", "is_ceo": true, "is_cfo": false}]}
**薪酬抽取**:抽取覆盖9类类型的薪酬价值:基本工资、股票奖励、期权奖励、非股权激励、养老金变动、其他薪酬、总薪酬、遣散费及CEO薪酬比率。
json
{"compensation_values": [{"executive_name": "Jane Doe", "value": "$1,200,000", "comp_type": "base_salary", "definition": "Annual base salary", "fiscal_year": "2025"}]}
**治理抽取**:抽取股东表决薪酬意见、追回政策及同行群体构成。
json
{"governance_items": [{"item_type": "say_on_pay", "value": "94.2%", "definition": "Advisory vote approval percentage", "fiscal_year": "2025"}]}
## 校正训练数据
`corrective`拆分用于训练模型规避常见抽取错误:
| 示例类型 | 样本数量 | 用途 |
|---|---|---|
| **校正后正样本** | 1968 | 经过验证清洗后的输出作为真值标签 |
| **挽救型校正样本** | 95 | 通过验证修正的错误示例(例如将股份计数前的`$3,205`校正为`3,205 shares`) |
| **负样本** | 2190 | 无有效数据的输入示例,用于训练模型输出空结果 |
### 核心校正信号
- **股份前加美元符号**:模型在股份计数前添加了`$`符号,校正为纯数字加“股份(shares)”
- **无单位股份计数**:模型遗漏了“shares(股份)”单位标签,需进行校正
- **幻觉性表述**:模型生成虚构的定义(如"does not contain""no specific"),需训练模型输出空结果
- **列标题名称**:模型将"Named Executive Officer"(命名高管)误提取为高管姓名,需丢弃此类样本
## 数据集格式
三种标准微调格式,示例内容完全一致:
| 格式类型 | 对应文件 | 适用场景 |
|--------|------|----------|
| **ShareGPT** | `sharegpt_*.jsonl` | 适用于Axolotl、Unsloth、LLaMA-Factory等工具 |
| **Alpaca** | `alpaca_*.jsonl` | 适用于斯坦福Alpaca格式的相关工具 |
| **OpenAI** | `openai_*.jsonl` | 适用于OpenAI微调API、HuggingFace TRL库 |
## 数据字段
### ShareGPT与Alpaca格式的元数据字段
| 字段名 | 数据类型 | 说明 |
|-------|------|-------------|
| `source_file` | 字符串 | SEC申报文件文件名 |
| `chunk_type` | 字符串 | 块类型,如`metadata`、`financial`、`covenant`、`exec_preamble`、`comp_table`等 |
| `task_type` | 字符串 | 任务类型,如`metadata_extraction`、`financial_extraction`、`compensation_extraction`等 |
| `pipeline` | 字符串 | 处理流水线,可选`exhibit10`或`proxy` |
| `company` | 字符串 | 标普500公司标准名称 |
| `ticker` | 字符串 | 股票代码 |
| `confidence_min` | 浮点数 | 最小抽取置信度(取值范围0.0-1.0) |
| `example_type` | 字符串 | 示例类型,校正拆分下可选`positive_corrected`、`corrective`或`negative` |
| `has_dollar_on_shares` | 布尔值 | 若为真,则该校正样本修复了股份计数前加美元符号的问题 |
| `has_bare_share_count` | 布尔值 | 若为真,则该校正样本修复了遗漏股份单位标签的问题 |
## 应用的质量过滤规则
### 附件10过滤规则
- 移除所有`"NONE"`值(共27条样本被排除)
- 移除孤立的`$`/`%`符号(共58条样本被排除)
- 移除置信度低于0.7的样本(共72条样本被排除)
- 移除源文本长度低于50字符的样本(共5条样本被排除)
### DEF 14A代理声明过滤规则
- 将基本工资超过500万美元的样本重新归类为`total_comp`(总薪酬),共35条
- 挽救股份计数前加美元符号的错误,共170条样本被校正
- 挽救遗漏股份单位标签的错误,共24条样本被校正
- 通过定义上下文挽救孤立数字符号,共30条样本被校正
- 丢弃列标题名称类样本,共199条样本被丢弃
- 丢弃空值/无效治理值样本,共249条样本被丢弃
- 移除股东表决赞成率低于50%的样本,共9条样本被丢弃
- 将董事费从`base_salary`(基本工资)重新归类为`other_comp`(其他薪酬),共20条样本被重新分类
## 数据集构建流程
### 抽取流水线
本数据集通过6阶段Python流水线处理原始HTML/TXT申报文件:
1. **采集器(Harvester)**:从SEC EDGAR下载目标附件
2. **切块器(Chopper)**:通过章节边界检测提取目标文本块
3. **抽取器(Extractor)**:将文本块路由至对应任务的大语言模型提示词(使用Gemma 4 2B,温度参数0.1)
4. **校验器(Reducer)**:通过14+质量阈值验证、标准化数值、重新分类错误标注的术语
5. **标准化器(Normalizer)**:通过CIK查询将实体名称映射至标普500公司标准名称
6. **训练数据生成器**:将原始输入与验证后的输出结合,应用质量过滤规则生成最终数据集
### 标签说明
本数据集的抽取结果由20亿参数模型生成,而非人工标注。尽管质量过滤规则过滤了明显的错误,但本数据集的标签为**银标准标签**,适用于模型微调,但不可作为金标准评估基准使用。
## 预期用途
- **微调小型大语言模型**(10亿-70亿参数),用于结构化金融文档抽取
- **领域自适应微调**,为需要理解SEC申报文件的模型进行适配
- **指令微调**,实现金融文本的JSON结构化输出
- **研究用途**,针对法律/金融文档的信息抽取开展相关研究
## 数据集局限性
- **时间范围**:仅覆盖6个月的申报文件窗口,不具备历史回测能力
- **覆盖范围**:仅包含标普500公司(美国大盘股)
- **语言限制**:仅支持英语
- **标签质量**:银标准标签(模型生成,非人工标注)
- **模型偏差**:Gemma 4 2B模型可能存在系统性抽取偏差,会迁移至微调后的模型
- **代理声明覆盖度**:仅150家标普500公司的DEF 14A申报文件被处理
## 引用
bibtex
@dataset{thetokenfactory2026sp500secextraction,
title={S&P 500 SEC Financial Extraction Instructions},
author={TheTokenFactory},
year={2026},
publisher={HuggingFace},
url={https://huggingface.co/datasets/TheTokenFactory/sec-contracts-financial-extraction-instructions}
}
## 许可协议
本数据集采用知识共享署名4.0(CC-BY-4.0)许可协议发布。SEC申报文件属于公有领域;本数据集的价值在于其结构化抽取结果、质量过滤规则及指令微调格式。
提供机构:
TheTokenFactory



