five

TheTokenFactory/sec-contracts-corrective-extraction

收藏
Hugging Face2026-04-10 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/TheTokenFactory/sec-contracts-corrective-extraction
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: cc-by-4.0 task_categories: - text-generation - token-classification language: - en tags: - finance - financial-nlp - sec-filings - sec-edgar - structured-extraction - information-extraction - instruction-tuning - fine-tuning - sharegpt - alpaca - chatml - corrective-training - hard-negatives - executive-compensation - proxy-statements - def-14a - json-extraction - sp500 - nlp dataset_info: - config_name: sharegpt features: - name: conversations list: - name: from dtype: string - name: value dtype: string - name: metadata struct: - name: source_file dtype: string - name: chunk_type dtype: string - name: task_type dtype: string - name: company dtype: string - name: ticker dtype: string - name: pipeline dtype: string - name: model_version dtype: string - name: iteration dtype: string - name: confidence_min dtype: float64 - name: example_type dtype: string - name: negative_reason dtype: string - name: drops_count dtype: int64 - name: rescued_count dtype: int64 - name: rescue_gates dtype: string - name: has_noncanonical_term_type dtype: bool - name: has_noncanonical_covenant_type dtype: bool - name: has_noncanonical_comp_type dtype: bool - name: has_dollar_on_shares dtype: bool - name: has_bare_share_count dtype: bool - config_name: alpaca features: - name: instruction dtype: string - name: input dtype: string - name: output dtype: string - name: metadata struct: - name: source_file dtype: string - name: chunk_type dtype: string - name: task_type dtype: string - name: company dtype: string - name: ticker dtype: string - name: pipeline dtype: string - name: model_version dtype: string - name: iteration dtype: string - name: confidence_min dtype: float64 - name: example_type dtype: string - name: negative_reason dtype: string - name: drops_count dtype: int64 - name: rescued_count dtype: int64 - name: rescue_gates dtype: string - name: has_noncanonical_term_type dtype: bool - name: has_noncanonical_covenant_type dtype: bool - name: has_noncanonical_comp_type dtype: bool - name: has_dollar_on_shares dtype: bool - name: has_bare_share_count dtype: bool - config_name: openai features: - name: messages list: - name: role dtype: string - name: content dtype: string configs: - config_name: sharegpt default: true data_files: - split: train path: "data/sharegpt_corrective.jsonl" - config_name: alpaca data_files: - split: train path: "data/alpaca_corrective.jsonl" - config_name: openai data_files: - split: train path: "data/openai_corrective.jsonl" size_categories: - 1K<n<10K pretty_name: SEC Financial Extraction - Corrective Training Data --- # S&P 500 SEC Financial Extractions - Corrective Dataset ## Dataset Summary **4,253 corrective instruction-tuning examples** designed to teach LLMs what the base model gets wrong when extracting structured financial data from SEC filings. Covers both Exhibit 10 material contracts and DEF 14A proxy statements from S&P 500 companies. This is a companion to [TheTokenFactory/sec-contracts-financial-extraction-instructions](https://huggingface.co/datasets/TheTokenFactory/sec-contracts-financial-extraction-instructions), which contains the positive training examples. | Pipeline | Examples | Filing Type | |----------|----------|-------------| | Exhibit 10 | 3,060 | Material contracts (8-K, 10-K, 10-Q EX-10 exhibits) | | DEF 14A | 1,193 | Proxy statements (executive compensation, governance) | ## Example Types | Type | Count | Description | |------|-------|-------------| | **Positive (corrected)** | 1,968 | Same input as raw extraction, but output is the post-reducer validated version | | **Corrective (rescued)** | 95 | Extractions where the reducer fixed a specific error - output shows the corrected value | | **Negative** | 2,190 | Inputs where all extractions were invalid - output is empty JSON, teaching the model to say "nothing here" | ## Key Corrective Signals ### Symbol Discipline (Proxy-specific) The model's biggest weakness is symbol handling on compensation tables where dollar amounts and share counts appear side by side: | Error | Count | Example | Correction | |-------|-------|---------|------------| | **Dollar on shares** | 50 | `$3,205` for "Performance Shares Earned" | `3,205 shares` | | **Bare share count** | 11 | `92,028` for "Restricted Stock Units" | `92,028 shares` | | **Missing dollar sign** | 30 | `9,525` for "Annual base salary" | `$9,525` | ### Hallucination Prevention | Error | Count | What it teaches | |-------|-------|-----------------| | **Hallucination phrases** | 23 | Drop when definition says "does not contain", "no specific", "page number" | | **Column headers as names** | 194 | Drop when exec name is "Named Executive Officer", "Total", etc. | | **Empty governance values** | 182 | Drop when governance value is null, "N/A", "not found" | ### Drop Gate Distribution (Negative Examples) | Gate | Count | Description | |------|-------|-------------| | EMPTY_VALUE | 202 | Model returned "NONE" marker | | COLUMN_HEADER_NAME | 194 | Table header used as executive name | | EMPTY_GOV_VALUE | 182 | Null/N/A governance values | | EMPTY_TYPE | 60 | Missing item_type | | BAD_COMP_TYPE | 55 | Non-canonical compensation type | | HALLUCINATION_PHRASE | 23 | Fabricated definitions | ## Formats Three standard fine-tuning formats with identical examples: | Format | File | Best For | |--------|------|----------| | **ShareGPT** | `sharegpt_corrective.jsonl` | Axolotl, Unsloth, LLaMA-Factory | | **Alpaca** | `alpaca_corrective.jsonl` | Stanford Alpaca format tools | | **OpenAI** | `openai_corrective.jsonl` | OpenAI fine-tuning API, HuggingFace TRL | ## Metadata Fields | Field | Type | Description | |-------|------|-------------| | `pipeline` | string | `exhibit10` or `proxy` | | `example_type` | string | `positive_corrected`, `corrective`, or `negative` | | `negative_reason` | string | Primary validation gate for negative examples | | `rescue_gates` | string | Comma-separated gates that triggered rescue | | `has_dollar_on_shares` | bool | True if this example corrects $ on share counts | | `has_bare_share_count` | bool | True if this example corrects missing "shares" label | | `drops_count` | int | Number of extractions dropped by validation | | `rescued_count` | int | Number of extractions rescued by validation | ## Dataset Creation Generated by comparing raw LLM extractions (pre-validation) against post-reducer validated outputs. The gap between raw and validated output defines the corrective signal. See the [extraction pipeline documentation](https://huggingface.co/datasets/TheTokenFactory/sec-contracts-financial-extraction-instructions) for full pipeline details. ### Important Note on Labels These are **silver-standard labels** generated by a 2B parameter model with automated validation. Suitable for fine-tuning but not for gold-standard evaluation. ## License [CC-BY-4.0](https://creativecommons.org/licenses/by/4.0/)
提供机构:
TheTokenFactory
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作