five

dylanalloy/swan

收藏
Hugging Face2023-11-21 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/dylanalloy/swan
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: cc-by-nc-4.0 task_categories: - text-generation language: - en tags: - finance - legal pretty_name: swan - finance dataset size_categories: - 100K<n<1M configs: - config_name: default data_files: - split: corpus path: "corpus.txt" - split: corpus_deduped path: "corpus_deduped.txt" - split: sec_tracker path: "all_sec_filings.csv" - split: leaked_tracker path: "all_leaked_pdfs.csv" - split: fed_tracker path: "all_fed_filings.csv" - split: bls_jolts_tracker path: "all_bls_jolts.csv" - split: bls_cpi_tracker path: "all_bls_cpi.csv" - split: bls_ces_tracker path: "all_bls_ces.csv" - split: bls_historical_tracker path: "all_bls_historical.csv" --- <!-- header start --> <div style="min-width:100%"> <center> <img style="max-width:200px" src="https://huggingface.co/datasets/dylanalloy/swan/resolve/main/swan.png"> <h3>swan</h3> <small>aggressively updated financial text dataset</small> <a href="https://github.com/DylanAlloy/swan_scrape" target="_blank">scraping code</a> </center> </div> <!-- header end --> ### usage ```python from datasets import load_dataset sets = ["corpus", "corpus_deduped"] swan_data, swan_deduped = [load_dataset("dylanalloy/swan", data_files=f"{_}.txt") for _ in sets] swan_data, swan_deduped ``` ### data <center> | data | added | | ----------- | ----------- | | SEC filings | Wed. Aug 30th, 2023 | | Federal Reserve transcripts | Wed. Aug 30th, 2023 | | private wealth management releases | Wed. Aug 30th, 2023 | | large bank releases | Wed. Aug 30th, 2023 | | large fund releases | Wed. Aug 30th, 2023 | | large trading firm releases | Wed. Aug 30th, 2023 | | BLS JOLTS releases | Wed. Aug 30th, 2023 | | BLS CPI releases | Wed. Aug 30th, 2023 | | BLS CES releases | Wed. Aug 30th, 2023 | | BLS historical reports | Wed. Aug 30th, 2023 | </center> ### updates <small>this repo updates daily at 6AM EST</small> | SEC Filngs | Federal Reserve transcripts | releases & reports | | :--- | :----: | ---: | | 30 minutes | daily | daily | <small>🐒 **corpus** ⌨️ updated daily</small> ### stats and delta <center> <img style="max-width:100%" src="https://huggingface.co/datasets/dylanalloy/swan/resolve/main/words_sizes.png"> <img style="max-width:100%" src="https://huggingface.co/datasets/dylanalloy/swan/resolve/main/vocab_sizes_time.png"> </center> ### organization - *.csv: tracker - corpus.txt: collated text from all documents across all categories (designed for base model training) - corpus_deduped.txt: unique lines of corpus
提供机构:
dylanalloy
原始信息汇总

数据集概述

基本信息

  • 许可证: cc-by-nc-4.0
  • 任务类别: text-generation
  • 语言: en
  • 标签: finance, legal
  • 名称: swan - finance dataset
  • 大小类别: 100K<n<1M

配置

  • 配置名称: default
    • 数据文件:
      • split: corpus
        • path: "corpus.txt"
      • split: corpus_deduped
        • path: "corpus_deduped.txt"
      • split: sec_tracker
        • path: "all_sec_filings.csv"
      • split: leaked_tracker
        • path: "all_leaked_pdfs.csv"
      • split: fed_tracker
        • path: "all_fed_filings.csv"
      • split: bls_jolts_tracker
        • path: "all_bls_jolts.csv"
      • split: bls_cpi_tracker
        • path: "all_bls_cpi.csv"
      • split: bls_ces_tracker
        • path: "all_bls_ces.csv"
      • split: bls_historical_tracker
        • path: "all_bls_historical.csv"

数据更新

  • 更新频率: 每日 6AM EST
  • 具体更新时间:
    • SEC Filings: 30 分钟
    • Federal Reserve transcripts: 每日
    • releases & reports: 每日

数据组织

  • 文件类型:
    • *.csv: tracker
    • corpus.txt: 所有类别文档的合并文本(用于基础模型训练)
    • corpus_deduped.txt: corpus 中的唯一行
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作