five

aanshshah/gaap-sec-compliance-dataset

收藏
Hugging Face2025-12-02 更新2025-12-20 收录
下载链接:
https://hf-mirror.com/datasets/aanshshah/gaap-sec-compliance-dataset
下载链接
链接失效反馈
官方服务:
资源简介:
--- language: - en license: apache-2.0 task_categories: - question-answering - text-retrieval - text-generation - summarization tags: - finance - accounting - gaap - sec - xbrl - financial-reporting - compliance - rag - retrieval-augmented-generation pretty_name: GAAP & SEC Compliance Dataset size_categories: - 100K<n<1M dataset_info: features: - name: id dtype: string - name: content dtype: string - name: metadata struct: - name: source dtype: string - name: type dtype: string - name: category dtype: string - name: code dtype: string - name: title dtype: string - name: date dtype: string - name: company dtype: string - name: form dtype: string splits: - name: train num_examples: 470151 configs: - config_name: default data_files: - split: train path: all_documents.jsonl --- # GAAP & SEC Compliance Dataset A comprehensive dataset for financial AI applications ## Dataset Overview This dataset contains **470,151 documents** covering US GAAP (Generally Accepted Accounting Principles) standards and SEC (Securities and Exchange Commission) filing requirements. It's designed for training and evaluating AI systems for financial compliance, accounting Q&A, and regulatory analysis. ### Key Statistics - **Total Documents**: 470,151 - **Average Length**: 363 characters - **Unique Companies**: 6,573 - **Date Range**: 2007-01-31 to 2025-12-01 - **Dataset Size**: ~296MB ## Content Distribution ### By Source - **XBRL**: 445,211 (94.7%) - **SEC_FILING**: 24,935 (5.3%) - **GAAP_STANDARD**: 5 (0.0%) ### By Document Type - **tag**: 445,211 (94.7%) - **financial_data**: 24,935 (5.3%) - **standard**: 5 (0.0%) ### By Category (Top 10) - **Other**: 294,775 (62.7%) - **Expenses**: 55,303 (11.8%) - **Assets**: 35,592 (7.6%) - **Liabilities**: 32,958 (7.0%) - **Income**: 24,658 (5.2%) - **Equity**: 19,732 (4.2%) - **Revenue**: 7,133 (1.5%) ## Use Cases ### AI Chatbots Build intelligent assistants for: - GAAP compliance questions - SEC filing analysis - Accounting standard lookup - Financial regulation guidance ### Information Retrieval Power search engines for: - Financial document discovery - Regulatory text mining - Compliance research - Academic studies ### Machine Learning Train models for: - Financial text classification - Accounting Q&A systems - Regulatory NLP tasks - Domain adaptation ## Live Demo **Interactive Chatbot**: [GAAP & SEC Compliance Chatbot](https://huggingface.co/spaces/aanshshah/gaap-sec-chatbot) Try the live demonstration powered by this dataset. The chatbot uses quantized Phi-3-Mini with RAG to answer professional questions about: - US GAAP accounting standards (ASC topics) - SEC filing requirements and regulations - Financial reporting compliance - Accounting treatment guidance **Note**: Demo runs on CPU-only hardware with intentional performance constraints for cost efficiency. ## Quick Start ### Load Dataset ```python from datasets import load_dataset # Load full dataset dataset = load_dataset("aanshshah/gaap-sec-compliance-dataset") # Or stream for memory efficiency dataset = load_dataset("aanshshah/gaap-sec-compliance-dataset", streaming=True) # Access examples for example in dataset["train"]: print(f"Title: {example['metadata']['title']}") print(f"Source: {example['metadata']['source']}") print(f"Content: {example['content'][:200]}...") break ``` ### Build RAG System ```python from transformers import pipeline from sentence_transformers import SentenceTransformer import faiss # Load dataset docs = load_dataset("aanshshah/gaap-sec-compliance-dataset")["train"] # Create embeddings encoder = SentenceTransformer('all-MiniLM-L6-v2') embeddings = encoder.encode([doc["content"] for doc in docs]) # Build FAISS index index = faiss.IndexFlatL2(384) index.add(embeddings) def search_docs(query, k=5): query_vec = encoder.encode([query]) _, indices = index.search(query_vec, k) return [docs[i] for i in indices[0]] # Example usage results = search_docs("What is ASC 606?") ``` ### Use with LangChain ```python from langchain.document_loaders import HuggingFaceDatasetLoader from langchain.vectorstores import FAISS from langchain.embeddings import HuggingFaceEmbeddings # Load documents loader = HuggingFaceDatasetLoader( path="aanshshah/gaap-sec-compliance-dataset", page_content_column="content" ) documents = loader.load() # Create vector store embeddings = HuggingFaceEmbeddings() vectorstore = FAISS.from_documents(documents, embeddings) # Query results = vectorstore.similarity_search("revenue recognition", k=5) ``` ## Dataset Structure ### Document Schema Each document contains: - **`id`**: Unique identifier - **`content`**: Full text content - **`metadata`**: Structured information including: - `source`: Origin (XBRL, SEC_FILING, GAAP_STANDARD) - `type`: Document type (tag, financial_data, standard) - `category`: Financial category (Assets, Revenue, etc.) - `code`: Standard code (e.g., "ASC 606", "us-gaap:Assets") - `title`: Human-readable title - `date`: Date in YYYY-MM-DD format - `company`: Company name (for SEC filings, null for others) - `form`: SEC form type (for SEC filings, null for others) ### Example Document ```json { "id": "gaap_standard_67a64e72e3390f7e", "content": "# ASC 606: Revenue from Contracts with Customers...", "metadata": { "source": "GAAP_STANDARD", "type": "standard", "category": "Revenue", "code": "ASC 606", "title": "ASC 606: Revenue from Contracts with Customers", "date": "2025-01-01", "company": null, "form": null } } ``` ## Data Creation Process ### Sources 1. **XBRL US GAAP Taxonomy** (94.7%) - Complete standardized accounting tags - Hierarchical relationships preserved 2. **SEC EDGAR Database** (5.3%) - Real company 10-K/10-Q filings - Quarterly data from 2007-2025 3. **FASB Standards** (<0.1%) - Core GAAP standards (ASC) - Implementation guidance ### Processing Pipeline 1. **Extraction**: Parse XBRL, HTML, PDF sources 2. **Standardization**: Convert to consistent JSON format 3. **Cleaning**: Remove duplicates and invalid entries 4. **Enrichment**: Add metadata and categories 5. **Validation**: Ensure quality and completeness ## Applications in Production ### Financial Institutions - Compliance monitoring systems - Risk assessment tools - Regulatory report generation - Audit automation ### FinTech Companies - AI-powered accounting assistants - Automated bookkeeping - Financial analysis platforms - Investment research tools ### Education & Training - Interactive learning platforms - Professional certification prep - Academic research - Student Q&A systems ## Quality & Coverage ### Quality Metrics - **Deduplicated**: No duplicate documents - **Validated**: All required fields present - **Cleaned**: Invalid entries removed - **Structured**: Consistent schema - **Current**: Up-to-date as of December 2025 ### Coverage Areas - Complete US GAAP taxonomy - Major public company filings - All accounting categories - Historical and current standards - Multiple filing types (10-K, 10-Q, 8-K) ## Legal & Ethics ### Data Sources - All data from public sources - No proprietary information - SEC EDGAR publicly available filings - XBRL taxonomy open standard ### Use Restrictions - Not for investment advice - Educational/research purposes - Verify critical information with official sources - Comply with applicable regulations ### Privacy - No personal identifying information - No material non-public information - Only public company data - Anonymized where appropriate ## Updates & Maintenance ### Version History - **v1.0.0** (December 2025): Initial release with 470K documents ### Update Schedule - Quarterly updates planned - New SEC filings added - GAAP standard updates included - Community feedback incorporated ## Support & Community ### Getting Help - [Discussions](https://huggingface.co/datasets/aanshshah/gaap-sec-compliance-dataset/discussions) - [Issues](https://huggingface.co/datasets/aanshshah/gaap-sec-compliance-dataset/discussions/new) - Contact via HuggingFace profile ### Contributing - Report data quality issues - Suggest additional sources - Share use cases and applications - Submit improvements ## Citation If you use this dataset in your research or applications, please cite: ```bibtex @dataset{gaap_sec_compliance_2025, title={GAAP & SEC Compliance Dataset}, author={Shah, Aansh}, year={2025}, month={12}, publisher={HuggingFace}, url={https://huggingface.co/datasets/aanshshah/gaap-sec-compliance-dataset}, note={A comprehensive dataset of 470,151 financial documents for AI applications} } ``` ## Acknowledgments - **XBRL US** for taxonomy data - **SEC EDGAR** for public filings - **FASB** for accounting standards - **HuggingFace** for hosting platform --- ## Dataset Statistics | Metric | Value | |--------|-------| | Documents | 470,151 | | Characters | 171,055,320 | | Companies | 6,573 | | Date Span | 6,573 days | | Storage | ~296MB | Built for the financial AI community Ready to build the next generation of financial AI? Start with this dataset!
提供机构:
aanshshah
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作