aanshshah/gaap-sec-compliance-dataset
收藏Hugging Face2025-12-02 更新2025-12-20 收录
下载链接:
https://hf-mirror.com/datasets/aanshshah/gaap-sec-compliance-dataset
下载链接
链接失效反馈官方服务:
资源简介:
---
language:
- en
license: apache-2.0
task_categories:
- question-answering
- text-retrieval
- text-generation
- summarization
tags:
- finance
- accounting
- gaap
- sec
- xbrl
- financial-reporting
- compliance
- rag
- retrieval-augmented-generation
pretty_name: GAAP & SEC Compliance Dataset
size_categories:
- 100K<n<1M
dataset_info:
features:
- name: id
dtype: string
- name: content
dtype: string
- name: metadata
struct:
- name: source
dtype: string
- name: type
dtype: string
- name: category
dtype: string
- name: code
dtype: string
- name: title
dtype: string
- name: date
dtype: string
- name: company
dtype: string
- name: form
dtype: string
splits:
- name: train
num_examples: 470151
configs:
- config_name: default
data_files:
- split: train
path: all_documents.jsonl
---
# GAAP & SEC Compliance Dataset
A comprehensive dataset for financial AI applications
## Dataset Overview
This dataset contains **470,151 documents** covering US GAAP (Generally Accepted Accounting Principles) standards and SEC (Securities and Exchange Commission) filing requirements. It's designed for training and evaluating AI systems for financial compliance, accounting Q&A, and regulatory analysis.
### Key Statistics
- **Total Documents**: 470,151
- **Average Length**: 363 characters
- **Unique Companies**: 6,573
- **Date Range**: 2007-01-31 to 2025-12-01
- **Dataset Size**: ~296MB
## Content Distribution
### By Source
- **XBRL**: 445,211 (94.7%)
- **SEC_FILING**: 24,935 (5.3%)
- **GAAP_STANDARD**: 5 (0.0%)
### By Document Type
- **tag**: 445,211 (94.7%)
- **financial_data**: 24,935 (5.3%)
- **standard**: 5 (0.0%)
### By Category (Top 10)
- **Other**: 294,775 (62.7%)
- **Expenses**: 55,303 (11.8%)
- **Assets**: 35,592 (7.6%)
- **Liabilities**: 32,958 (7.0%)
- **Income**: 24,658 (5.2%)
- **Equity**: 19,732 (4.2%)
- **Revenue**: 7,133 (1.5%)
## Use Cases
### AI Chatbots
Build intelligent assistants for:
- GAAP compliance questions
- SEC filing analysis
- Accounting standard lookup
- Financial regulation guidance
### Information Retrieval
Power search engines for:
- Financial document discovery
- Regulatory text mining
- Compliance research
- Academic studies
### Machine Learning
Train models for:
- Financial text classification
- Accounting Q&A systems
- Regulatory NLP tasks
- Domain adaptation
## Live Demo
**Interactive Chatbot**: [GAAP & SEC Compliance Chatbot](https://huggingface.co/spaces/aanshshah/gaap-sec-chatbot)
Try the live demonstration powered by this dataset. The chatbot uses quantized Phi-3-Mini with RAG to answer professional questions about:
- US GAAP accounting standards (ASC topics)
- SEC filing requirements and regulations
- Financial reporting compliance
- Accounting treatment guidance
**Note**: Demo runs on CPU-only hardware with intentional performance constraints for cost efficiency.
## Quick Start
### Load Dataset
```python
from datasets import load_dataset
# Load full dataset
dataset = load_dataset("aanshshah/gaap-sec-compliance-dataset")
# Or stream for memory efficiency
dataset = load_dataset("aanshshah/gaap-sec-compliance-dataset", streaming=True)
# Access examples
for example in dataset["train"]:
print(f"Title: {example['metadata']['title']}")
print(f"Source: {example['metadata']['source']}")
print(f"Content: {example['content'][:200]}...")
break
```
### Build RAG System
```python
from transformers import pipeline
from sentence_transformers import SentenceTransformer
import faiss
# Load dataset
docs = load_dataset("aanshshah/gaap-sec-compliance-dataset")["train"]
# Create embeddings
encoder = SentenceTransformer('all-MiniLM-L6-v2')
embeddings = encoder.encode([doc["content"] for doc in docs])
# Build FAISS index
index = faiss.IndexFlatL2(384)
index.add(embeddings)
def search_docs(query, k=5):
query_vec = encoder.encode([query])
_, indices = index.search(query_vec, k)
return [docs[i] for i in indices[0]]
# Example usage
results = search_docs("What is ASC 606?")
```
### Use with LangChain
```python
from langchain.document_loaders import HuggingFaceDatasetLoader
from langchain.vectorstores import FAISS
from langchain.embeddings import HuggingFaceEmbeddings
# Load documents
loader = HuggingFaceDatasetLoader(
path="aanshshah/gaap-sec-compliance-dataset",
page_content_column="content"
)
documents = loader.load()
# Create vector store
embeddings = HuggingFaceEmbeddings()
vectorstore = FAISS.from_documents(documents, embeddings)
# Query
results = vectorstore.similarity_search("revenue recognition", k=5)
```
## Dataset Structure
### Document Schema
Each document contains:
- **`id`**: Unique identifier
- **`content`**: Full text content
- **`metadata`**: Structured information including:
- `source`: Origin (XBRL, SEC_FILING, GAAP_STANDARD)
- `type`: Document type (tag, financial_data, standard)
- `category`: Financial category (Assets, Revenue, etc.)
- `code`: Standard code (e.g., "ASC 606", "us-gaap:Assets")
- `title`: Human-readable title
- `date`: Date in YYYY-MM-DD format
- `company`: Company name (for SEC filings, null for others)
- `form`: SEC form type (for SEC filings, null for others)
### Example Document
```json
{
"id": "gaap_standard_67a64e72e3390f7e",
"content": "# ASC 606: Revenue from Contracts with Customers...",
"metadata": {
"source": "GAAP_STANDARD",
"type": "standard",
"category": "Revenue",
"code": "ASC 606",
"title": "ASC 606: Revenue from Contracts with Customers",
"date": "2025-01-01",
"company": null,
"form": null
}
}
```
## Data Creation Process
### Sources
1. **XBRL US GAAP Taxonomy** (94.7%)
- Complete standardized accounting tags
- Hierarchical relationships preserved
2. **SEC EDGAR Database** (5.3%)
- Real company 10-K/10-Q filings
- Quarterly data from 2007-2025
3. **FASB Standards** (<0.1%)
- Core GAAP standards (ASC)
- Implementation guidance
### Processing Pipeline
1. **Extraction**: Parse XBRL, HTML, PDF sources
2. **Standardization**: Convert to consistent JSON format
3. **Cleaning**: Remove duplicates and invalid entries
4. **Enrichment**: Add metadata and categories
5. **Validation**: Ensure quality and completeness
## Applications in Production
### Financial Institutions
- Compliance monitoring systems
- Risk assessment tools
- Regulatory report generation
- Audit automation
### FinTech Companies
- AI-powered accounting assistants
- Automated bookkeeping
- Financial analysis platforms
- Investment research tools
### Education & Training
- Interactive learning platforms
- Professional certification prep
- Academic research
- Student Q&A systems
## Quality & Coverage
### Quality Metrics
- **Deduplicated**: No duplicate documents
- **Validated**: All required fields present
- **Cleaned**: Invalid entries removed
- **Structured**: Consistent schema
- **Current**: Up-to-date as of December 2025
### Coverage Areas
- Complete US GAAP taxonomy
- Major public company filings
- All accounting categories
- Historical and current standards
- Multiple filing types (10-K, 10-Q, 8-K)
## Legal & Ethics
### Data Sources
- All data from public sources
- No proprietary information
- SEC EDGAR publicly available filings
- XBRL taxonomy open standard
### Use Restrictions
- Not for investment advice
- Educational/research purposes
- Verify critical information with official sources
- Comply with applicable regulations
### Privacy
- No personal identifying information
- No material non-public information
- Only public company data
- Anonymized where appropriate
## Updates & Maintenance
### Version History
- **v1.0.0** (December 2025): Initial release with 470K documents
### Update Schedule
- Quarterly updates planned
- New SEC filings added
- GAAP standard updates included
- Community feedback incorporated
## Support & Community
### Getting Help
- [Discussions](https://huggingface.co/datasets/aanshshah/gaap-sec-compliance-dataset/discussions)
- [Issues](https://huggingface.co/datasets/aanshshah/gaap-sec-compliance-dataset/discussions/new)
- Contact via HuggingFace profile
### Contributing
- Report data quality issues
- Suggest additional sources
- Share use cases and applications
- Submit improvements
## Citation
If you use this dataset in your research or applications, please cite:
```bibtex
@dataset{gaap_sec_compliance_2025,
title={GAAP & SEC Compliance Dataset},
author={Shah, Aansh},
year={2025},
month={12},
publisher={HuggingFace},
url={https://huggingface.co/datasets/aanshshah/gaap-sec-compliance-dataset},
note={A comprehensive dataset of 470,151 financial documents for AI applications}
}
```
## Acknowledgments
- **XBRL US** for taxonomy data
- **SEC EDGAR** for public filings
- **FASB** for accounting standards
- **HuggingFace** for hosting platform
---
## Dataset Statistics
| Metric | Value |
|--------|-------|
| Documents | 470,151 |
| Characters | 171,055,320 |
| Companies | 6,573 |
| Date Span | 6,573 days |
| Storage | ~296MB |
Built for the financial AI community
Ready to build the next generation of financial AI? Start with this dataset!
提供机构:
aanshshah



