five

TheTokenFactory/sec-contracts-financial-extraction-instructions

收藏
Hugging Face2026-04-10 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/TheTokenFactory/sec-contracts-financial-extraction-instructions
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: cc-by-4.0 task_categories: - text-generation - token-classification language: - en tags: - finance - financial-nlp - sec-filings - sec-edgar - structured-extraction - information-extraction - named-entity-recognition - ner - legal - contracts - debt-covenants - executive-compensation - proxy-statements - def-14a - credit-agreements - instruction-tuning - fine-tuning - sharegpt - alpaca - chatml - json-extraction - sp500 - nlp dataset_info: - config_name: sharegpt features: - name: conversations list: - name: from dtype: string - name: value dtype: string - name: metadata struct: - name: source_file dtype: string - name: chunk_type dtype: string - name: task_type dtype: string - name: company dtype: string - name: ticker dtype: string - name: pipeline dtype: string - name: model_version dtype: string - name: iteration dtype: string - name: confidence_min dtype: float64 - name: example_type dtype: string - name: negative_reason dtype: string - name: drops_count dtype: int64 - name: rescued_count dtype: int64 - name: rescue_gates dtype: string - name: has_noncanonical_term_type dtype: bool - name: has_noncanonical_covenant_type dtype: bool - name: has_noncanonical_comp_type dtype: bool - name: has_dollar_on_shares dtype: bool - name: has_bare_share_count dtype: bool - config_name: alpaca features: - name: instruction dtype: string - name: input dtype: string - name: output dtype: string - name: metadata struct: - name: source_file dtype: string - name: chunk_type dtype: string - name: task_type dtype: string - name: company dtype: string - name: ticker dtype: string - name: pipeline dtype: string - name: model_version dtype: string - name: iteration dtype: string - name: confidence_min dtype: float64 - name: example_type dtype: string - name: negative_reason dtype: string - name: drops_count dtype: int64 - name: rescued_count dtype: int64 - name: rescue_gates dtype: string - name: has_noncanonical_term_type dtype: bool - name: has_noncanonical_covenant_type dtype: bool - name: has_noncanonical_comp_type dtype: bool - name: has_dollar_on_shares dtype: bool - name: has_bare_share_count dtype: bool - config_name: openai features: - name: messages list: - name: role dtype: string - name: content dtype: string configs: - config_name: sharegpt default: true data_files: - split: train path: "data/sharegpt_financial_extraction.jsonl" - split: corrective path: "data/sharegpt_corrective.jsonl" - config_name: alpaca data_files: - split: train path: "data/alpaca_financial_extraction.jsonl" - split: corrective path: "data/alpaca_corrective.jsonl" - config_name: openai data_files: - split: train path: "data/openai_financial_extraction.jsonl" - split: corrective path: "data/openai_corrective.jsonl" size_categories: - 1K<n<10K pretty_name: SEC Financial Extraction Instructions (Exhibit 10 + DEF 14A) --- # S&P 500 SEC Financial Extraction Instructions ## Dataset Summary **7,683 instruction-tuning examples** for training LLMs to extract structured financial data from SEC filings. Covers two filing types across S&P 500 companies: | Split | Examples | Filing Type | Description | |-------|----------|-------------|-------------| | **train** | 3,430 | Exhibit 10 + DEF 14A | Positive examples with validated outputs | | **corrective** | 4,253 | Exhibit 10 + DEF 14A | Corrective, rescued, and negative examples | ### Exhibit 10 — Material Contracts (2,726 positive + 3,060 corrective) | Task | Positive | Corrective | Description | |------|----------|------------|-------------| | Metadata Extraction | 1,028 | 1,027 | Effective dates and contracting party names | | Financial Term Extraction | 1,434 | 1,600 | Dollar amounts, percentages, 13 term types | | Covenant Extraction | 264 | 433 | Debt covenants, thresholds, 7 covenant types | ### DEF 14A — Proxy Statements (704 positive + 1,193 corrective) | Task | Positive | Corrective | Description | |------|----------|------------|-------------| | Exec Metadata Extraction | 150 | 150 | Named Executive Officers, CEO/CFO identification | | Compensation Extraction | 293 | 750 | Executive compensation with 9 comp types | | Governance Extraction | 261 | 293 | Say-on-pay, clawback policies, peer groups | ### Source - **Filings:** SEC EDGAR EX-10 exhibits (8-K, 10-K, 10-Q) and DEF 14A proxy statements - **Companies:** 368 unique S&P 500 companies - **Documents:** 1,028 material contracts + 150 proxy statements - **Extraction model:** Gemma 4 2B (Q4_K_M quantized) at temperature 0.1 ## Extraction Tasks ### Exhibit 10 Tasks **Metadata Extraction** — Given a contract preamble, extract effective date and contracting parties. ```json {"effective_date": "YYYY-MM-DD", "primary_party_1": "Name", "primary_party_2": "Name"} ``` **Financial Term Extraction** — Extract monetary values with 13 term types: salary, bonus, severance, retirement_benefit, equity_grant, credit_facility, loan_amount, interest_rate, fee, threshold, purchase_price, compensation, other. ```json {"financial_values": [{"value": "$1,500,000", "definition": "Annual base salary for CEO", "term_type": "salary"}]} ``` **Covenant Extraction** — Extract debt covenants with 7 types: leverage_ratio, interest_coverage, debt_service, net_worth, liquidity, fixed_charge, other. ```json {"covenants": [{"covenant_type": "leverage_ratio", "threshold_value": "3.50x", "definition": "Maximum Consolidated Leverage Ratio"}]} ``` ### DEF 14A Tasks **Exec Metadata Extraction** — Identify Named Executive Officers and fiscal year. ```json {"fiscal_year": "2025", "named_executive_officers": [{"name": "Jane Doe", "title": "CEO", "is_ceo": true, "is_cfo": false}]} ``` **Compensation Extraction** — Extract compensation values with 9 types: base_salary, stock_award, option_award, non_equity_incentive, pension_change, other_comp, total_comp, severance, ceo_pay_ratio. ```json {"compensation_values": [{"executive_name": "Jane Doe", "value": "$1,200,000", "comp_type": "base_salary", "definition": "Annual base salary", "fiscal_year": "2025"}]} ``` **Governance Extraction** — Extract say-on-pay votes, clawback policies, and peer group compositions. ```json {"governance_items": [{"item_type": "say_on_pay", "value": "94.2%", "definition": "Advisory vote approval percentage", "fiscal_year": "2025"}]} ``` ## Corrective Training Data The `corrective` split teaches models to avoid common extraction errors: | Example Type | Count | Purpose | |---|---|---| | **Positive (corrected)** | 1,968 | Post-validation cleaned output as ground truth | | **Corrective (rescued)** | 95 | Errors corrected by validation (e.g., `$3,205` on share counts -> `3,205 shares`) | | **Negative** | 2,190 | Inputs with no valid data — teaches empty output | ### Key Corrective Signals - **Dollar-on-shares:** Model puts `$` on share counts — corrected to bare number + "shares" - **Bare share counts:** Model omits "shares" label on unit counts — corrected - **Hallucination phrases:** Model fabricates definitions ("does not contain", "no specific") — teaches empty output - **Column header names:** Model extracts "Named Executive Officer" as an exec name — dropped ## Dataset Formats Three standard fine-tuning formats with identical examples: | Format | File | Best For | |--------|------|----------| | **ShareGPT** | `sharegpt_*.jsonl` | Axolotl, Unsloth, LLaMA-Factory | | **Alpaca** | `alpaca_*.jsonl` | Stanford Alpaca format tools | | **OpenAI** | `openai_*.jsonl` | OpenAI fine-tuning API, HuggingFace TRL | ## Data Fields ### Metadata Fields (ShareGPT and Alpaca formats) | Field | Type | Description | |-------|------|-------------| | `source_file` | string | SEC filing filename | | `chunk_type` | string | `metadata`, `financial`, `covenant`, `exec_preamble`, `comp_table`, etc. | | `task_type` | string | `metadata_extraction`, `financial_extraction`, `compensation_extraction`, etc. | | `pipeline` | string | `exhibit10` or `proxy` | | `company` | string | Canonical S&P 500 company name | | `ticker` | string | Stock ticker symbol | | `confidence_min` | float | Minimum extraction confidence (0.0-1.0) | | `example_type` | string | `positive_corrected`, `corrective`, or `negative` (corrective split only) | | `has_dollar_on_shares` | bool | True if corrective example fixes dollar sign on share counts | | `has_bare_share_count` | bool | True if corrective example fixes missing "shares" label | ## Quality Filters Applied ### Exhibit 10 - All `"NONE"` values removed (27 excluded) - Bare `$`/`%` symbols removed (58 excluded) - Confidence < 0.7 removed (72 excluded) - Short source text < 50 chars removed (5 excluded) ### DEF 14A Proxy - Base salary > $5M reclassified to total_comp (35 reclassified) - Dollar signs on share counts rescued (170 corrected) - Bare share counts rescued (24 corrected) - Bare number symbols rescued via definition context (30 corrected) - Column header names dropped (199 dropped) - Empty/null governance values dropped (249 dropped) - Say-on-pay < 50% dropped (9 dropped) - Director fees reclassified from base_salary to other_comp (20 reclassified) ## Dataset Creation ### Extraction Pipeline A 6-stage Python pipeline processes raw HTML/TXT filings: 1. **Harvester** — Downloads exhibits from SEC EDGAR 2. **Chopper** — Extracts targeted text blocks using section boundary detection 3. **Extractor** — Routes chunks to task-specific LLM prompts (Gemma 4 2B, temperature 0.1) 4. **Reducer** — Validates through 14+ quality gates, normalizes values, reclassifies mistyped terms 5. **Normalizer** — Resolves entity names to S&P 500 canonical names via CIK lookup 6. **Training Data Generator** — Joins raw inputs with validated outputs, applies quality filters ### Important Note on Labels Extractions were produced by a 2B parameter model, not human annotators. While quality gates filter obvious errors, these are **silver-standard labels** — suitable for fine-tuning but not for use as a gold-standard evaluation benchmark. ## Intended Uses - **Fine-tuning small LLMs** (1B-7B) for structured financial document extraction - **Domain adaptation** for models that need SEC filing understanding - **Instruction-tuning** for JSON-structured output from financial text - **Research** on information extraction from legal/financial documents ## Limitations - **Temporal scope:** 6-month filing window (not a historical backtest) - **Universe:** S&P 500 only (large-cap US equities) - **Language:** English only - **Label quality:** Silver-standard (model-generated, not human-annotated) - **Model bias:** Gemma 4 2B may have systematic extraction patterns that transfer to fine-tuned models - **Proxy coverage:** 150 of 500 S&P 500 companies had DEF 14A filings processed ## Citation ```bibtex @dataset{thetokenfactory2026sp500secextraction, title={S&P 500 SEC Financial Extraction Instructions}, author={TheTokenFactory}, year={2026}, publisher={HuggingFace}, url={https://huggingface.co/datasets/TheTokenFactory/sec-contracts-financial-extraction-instructions} } ``` ## License This dataset is released under [CC-BY-4.0](https://creativecommons.org/licenses/by/4.0/). SEC filings are public domain; this dataset's value is in the structured extraction, quality filtering, and instruction-tuning format.

--- 许可协议:知识共享署名4.0(CC-BY-4.0) 任务类别: - 文本生成 - 令牌分类(Token) 语言: - 英语 标签: - 金融 - 金融自然语言处理(Financial NLP) - SEC文件(SEC filings) - SEC EDGAR - 结构化抽取 - 信息抽取 - 命名实体识别(Named Entity Recognition) - NER - 法律 - 合同 - 债务契约 - 高管薪酬 - 代理声明 - DEF 14A - 信贷协议 - 指令微调 - 微调 - ShareGPT - Alpaca - ChatML - JSON抽取 - 标普500(S&P 500) - 自然语言处理(NLP) dataset_info: - config_name: sharegpt features: - name: conversations list: - name: from dtype: string - name: value dtype: string - name: metadata struct: - name: source_file dtype: string - name: chunk_type dtype: string - name: task_type dtype: string - name: company dtype: string - name: ticker dtype: string - name: pipeline dtype: string - name: model_version dtype: string - name: iteration dtype: string - name: confidence_min dtype: float64 - name: example_type dtype: string - name: negative_reason dtype: string - name: drops_count dtype: int64 - name: rescued_count dtype: int64 - name: rescue_gates dtype: string - name: has_noncanonical_term_type dtype: bool - name: has_noncanonical_covenant_type dtype: bool - name: has_noncanonical_comp_type dtype: bool - name: has_dollar_on_shares dtype: bool - name: has_bare_share_count dtype: bool - config_name: alpaca features: - name: instruction dtype: string - name: input dtype: string - name: output dtype: string - name: metadata struct: - name: source_file dtype: string - name: chunk_type dtype: string - name: task_type dtype: string - name: company dtype: string - name: ticker dtype: string - name: pipeline dtype: string - name: model_version dtype: string - name: iteration dtype: string - name: confidence_min dtype: float64 - name: example_type dtype: string - name: negative_reason dtype: string - name: drops_count dtype: int64 - name: rescued_count dtype: int64 - name: rescue_gates dtype: string - name: has_noncanonical_term_type dtype: bool - name: has_noncanonical_covenant_type dtype: bool - name: has_noncanonical_comp_type dtype: bool - name: has_dollar_on_shares dtype: bool - name: has_bare_share_count dtype: bool - config_name: openai features: - name: messages list: - name: role dtype: string - name: content dtype: string configs: - config_name: sharegpt default: true data_files: - split: train path: "data/sharegpt_financial_extraction.jsonl" - split: corrective path: "data/sharegpt_corrective.jsonl" - config_name: alpaca data_files: - split: train path: "data/alpaca_financial_extraction.jsonl" - split: corrective path: "data/alpaca_corrective.jsonl" - config_name: openai data_files: - split: train path: "data/openai_financial_extraction.jsonl" - split: corrective path: "data/openai_corrective.jsonl" size_categories: - 1K<n<10K pretty_name: SEC金融抽取指令集(附件10 + DEF 14A) --- # 标普500(S&P 500)SEC金融抽取指令集 ## 数据集概览 **共7683条指令微调示例**,用于训练大语言模型(LLM/Large Language Model)从SEC文件中抽取结构化金融数据,覆盖标普500公司的两类申报文件: | 数据集拆分 | 示例数量 | 申报文件类型 | 说明 | |-------|----------|-------------|-------------| | **训练集(train)** | 3430 | 附件10 + DEF 14A | 经过验证输出的正样本示例 | | **校正集(corrective)** | 4253 | 附件10 + DEF 14A | 校正、挽救及负样本示例 | ### 附件10 — 重大合同(2726条正样本 + 3060条校正样本) | 任务类型 | 正样本数 | 校正样本数 | 说明 | |------|----------|------------|-------------| | 元数据抽取 | 1028 | 1027 | 生效日期与合同方名称 | | 金融术语抽取 | 1434 | 1600 | 美元金额、百分比及13类术语 | | 契约抽取 | 264 | 433 | 债务契约、阈值及7类契约类型 | ### DEF 14A — 代理声明(704条正样本 + 1193条校正样本) | 任务类型 | 正样本数 | 校正样本数 | 说明 | |------|----------|------------|-------------| | 高管元数据抽取 | 150 | 150 | 命名高管、CEO/CFO识别 | | 薪酬抽取 | 293 | 750 | 覆盖9类薪酬类型的高管薪酬数据 | | 治理抽取 | 261 | 293 | 股东表决薪酬意见、追回政策及同行群体 | ### 数据来源 - **申报文件**:SEC EDGAR的EX-10附件(8-K、10-K、10-Q文件)及DEF 14A代理声明 - **覆盖公司**:368家独立的标普500公司 - **文档数量**:1028份重大合同 + 150份代理声明 - **抽取模型**:Gemma 4 2B(Q4_K_M量化版本),温度参数设为0.1 ## 抽取任务 ### 附件10抽取任务 **元数据抽取**:给定合同序言,提取生效日期与合同方名称。 json {"effective_date": "YYYY-MM-DD", "primary_party_1": "Name", "primary_party_2": "Name"} **金融术语抽取**:抽取覆盖13类术语的货币价值:薪酬、奖金、遣散费、退休福利、股权激励、信贷额度、贷款金额、利率、费用、阈值、收购价格、薪酬及其他。 json {"financial_values": [{"value": "$1,500,000", "definition": "Annual base salary for CEO", "term_type": "salary"}]} **契约抽取**:抽取覆盖7类类型的债务契约:杠杆率、利息保障倍数、债务偿付、净资产、流动性、固定费用及其他。 json {"covenants": [{"covenant_type": "leverage_ratio", "threshold_value": "3.50x", "definition": "Maximum Consolidated Leverage Ratio"}]} ### DEF 14A抽取任务 **高管元数据抽取**:识别命名高管及财年。 json {"fiscal_year": "2025", "named_executive_officers": [{"name": "Jane Doe", "title": "CEO", "is_ceo": true, "is_cfo": false}]} **薪酬抽取**:抽取覆盖9类类型的薪酬价值:基本工资、股票奖励、期权奖励、非股权激励、养老金变动、其他薪酬、总薪酬、遣散费及CEO薪酬比率。 json {"compensation_values": [{"executive_name": "Jane Doe", "value": "$1,200,000", "comp_type": "base_salary", "definition": "Annual base salary", "fiscal_year": "2025"}]} **治理抽取**:抽取股东表决薪酬意见、追回政策及同行群体构成。 json {"governance_items": [{"item_type": "say_on_pay", "value": "94.2%", "definition": "Advisory vote approval percentage", "fiscal_year": "2025"}]} ## 校正训练数据 `corrective`拆分用于训练模型规避常见抽取错误: | 示例类型 | 样本数量 | 用途 | |---|---|---| | **校正后正样本** | 1968 | 经过验证清洗后的输出作为真值标签 | | **挽救型校正样本** | 95 | 通过验证修正的错误示例(例如将股份计数前的`$3,205`校正为`3,205 shares`) | | **负样本** | 2190 | 无有效数据的输入示例,用于训练模型输出空结果 | ### 核心校正信号 - **股份前加美元符号**:模型在股份计数前添加了`$`符号,校正为纯数字加“股份(shares)” - **无单位股份计数**:模型遗漏了“shares(股份)”单位标签,需进行校正 - **幻觉性表述**:模型生成虚构的定义(如"does not contain""no specific"),需训练模型输出空结果 - **列标题名称**:模型将"Named Executive Officer"(命名高管)误提取为高管姓名,需丢弃此类样本 ## 数据集格式 三种标准微调格式,示例内容完全一致: | 格式类型 | 对应文件 | 适用场景 | |--------|------|----------| | **ShareGPT** | `sharegpt_*.jsonl` | 适用于Axolotl、Unsloth、LLaMA-Factory等工具 | | **Alpaca** | `alpaca_*.jsonl` | 适用于斯坦福Alpaca格式的相关工具 | | **OpenAI** | `openai_*.jsonl` | 适用于OpenAI微调API、HuggingFace TRL库 | ## 数据字段 ### ShareGPT与Alpaca格式的元数据字段 | 字段名 | 数据类型 | 说明 | |-------|------|-------------| | `source_file` | 字符串 | SEC申报文件文件名 | | `chunk_type` | 字符串 | 块类型,如`metadata`、`financial`、`covenant`、`exec_preamble`、`comp_table`等 | | `task_type` | 字符串 | 任务类型,如`metadata_extraction`、`financial_extraction`、`compensation_extraction`等 | | `pipeline` | 字符串 | 处理流水线,可选`exhibit10`或`proxy` | | `company` | 字符串 | 标普500公司标准名称 | | `ticker` | 字符串 | 股票代码 | | `confidence_min` | 浮点数 | 最小抽取置信度(取值范围0.0-1.0) | | `example_type` | 字符串 | 示例类型,校正拆分下可选`positive_corrected`、`corrective`或`negative` | | `has_dollar_on_shares` | 布尔值 | 若为真,则该校正样本修复了股份计数前加美元符号的问题 | | `has_bare_share_count` | 布尔值 | 若为真,则该校正样本修复了遗漏股份单位标签的问题 | ## 应用的质量过滤规则 ### 附件10过滤规则 - 移除所有`"NONE"`值(共27条样本被排除) - 移除孤立的`$`/`%`符号(共58条样本被排除) - 移除置信度低于0.7的样本(共72条样本被排除) - 移除源文本长度低于50字符的样本(共5条样本被排除) ### DEF 14A代理声明过滤规则 - 将基本工资超过500万美元的样本重新归类为`total_comp`(总薪酬),共35条 - 挽救股份计数前加美元符号的错误,共170条样本被校正 - 挽救遗漏股份单位标签的错误,共24条样本被校正 - 通过定义上下文挽救孤立数字符号,共30条样本被校正 - 丢弃列标题名称类样本,共199条样本被丢弃 - 丢弃空值/无效治理值样本,共249条样本被丢弃 - 移除股东表决赞成率低于50%的样本,共9条样本被丢弃 - 将董事费从`base_salary`(基本工资)重新归类为`other_comp`(其他薪酬),共20条样本被重新分类 ## 数据集构建流程 ### 抽取流水线 本数据集通过6阶段Python流水线处理原始HTML/TXT申报文件: 1. **采集器(Harvester)**:从SEC EDGAR下载目标附件 2. **切块器(Chopper)**:通过章节边界检测提取目标文本块 3. **抽取器(Extractor)**:将文本块路由至对应任务的大语言模型提示词(使用Gemma 4 2B,温度参数0.1) 4. **校验器(Reducer)**:通过14+质量阈值验证、标准化数值、重新分类错误标注的术语 5. **标准化器(Normalizer)**:通过CIK查询将实体名称映射至标普500公司标准名称 6. **训练数据生成器**:将原始输入与验证后的输出结合,应用质量过滤规则生成最终数据集 ### 标签说明 本数据集的抽取结果由20亿参数模型生成,而非人工标注。尽管质量过滤规则过滤了明显的错误,但本数据集的标签为**银标准标签**,适用于模型微调,但不可作为金标准评估基准使用。 ## 预期用途 - **微调小型大语言模型**(10亿-70亿参数),用于结构化金融文档抽取 - **领域自适应微调**,为需要理解SEC申报文件的模型进行适配 - **指令微调**,实现金融文本的JSON结构化输出 - **研究用途**,针对法律/金融文档的信息抽取开展相关研究 ## 数据集局限性 - **时间范围**:仅覆盖6个月的申报文件窗口,不具备历史回测能力 - **覆盖范围**:仅包含标普500公司(美国大盘股) - **语言限制**:仅支持英语 - **标签质量**:银标准标签(模型生成,非人工标注) - **模型偏差**:Gemma 4 2B模型可能存在系统性抽取偏差,会迁移至微调后的模型 - **代理声明覆盖度**:仅150家标普500公司的DEF 14A申报文件被处理 ## 引用 bibtex @dataset{thetokenfactory2026sp500secextraction, title={S&P 500 SEC Financial Extraction Instructions}, author={TheTokenFactory}, year={2026}, publisher={HuggingFace}, url={https://huggingface.co/datasets/TheTokenFactory/sec-contracts-financial-extraction-instructions} } ## 许可协议 本数据集采用知识共享署名4.0(CC-BY-4.0)许可协议发布。SEC申报文件属于公有领域;本数据集的价值在于其结构化抽取结果、质量过滤规则及指令微调格式。
提供机构:
TheTokenFactory
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作