dylanalloy/swan

Name: dylanalloy/swan
Creator: dylanalloy
Published: 2023-11-21 11:00:06
License: 暂无描述

Hugging Face2023-11-21 更新2024-03-04 收录

下载链接：

https://hf-mirror.com/datasets/dylanalloy/swan

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: cc-by-nc-4.0 task_categories: - text-generation language: - en tags: - finance - legal pretty_name: swan - finance dataset size_categories: - 100K<n<1M configs: - config_name: default data_files: - split: corpus path: "corpus.txt" - split: corpus_deduped path: "corpus_deduped.txt" - split: sec_tracker path: "all_sec_filings.csv" - split: leaked_tracker path: "all_leaked_pdfs.csv" - split: fed_tracker path: "all_fed_filings.csv" - split: bls_jolts_tracker path: "all_bls_jolts.csv" - split: bls_cpi_tracker path: "all_bls_cpi.csv" - split: bls_ces_tracker path: "all_bls_ces.csv" - split: bls_historical_tracker path: "all_bls_historical.csv" ---  <div style="min-width:100%"> <center> <img style="max-width:200px" src="https://huggingface.co/datasets/dylanalloy/swan/resolve/main/swan.png"> <h3>swan</h3> <small>aggressively updated financial text dataset</small> <a href="https://github.com/DylanAlloy/swan_scrape" target="_blank">scraping code</a> </center> </div>  ### usage ```python from datasets import load_dataset sets = ["corpus", "corpus_deduped"] swan_data, swan_deduped = [load_dataset("dylanalloy/swan", data_files=f"{_}.txt") for _ in sets] swan_data, swan_deduped ``` ### data <center> | data | added | | ----------- | ----------- | | SEC filings | Wed. Aug 30th, 2023 | | Federal Reserve transcripts | Wed. Aug 30th, 2023 | | private wealth management releases | Wed. Aug 30th, 2023 | | large bank releases | Wed. Aug 30th, 2023 | | large fund releases | Wed. Aug 30th, 2023 | | large trading firm releases | Wed. Aug 30th, 2023 | | BLS JOLTS releases | Wed. Aug 30th, 2023 | | BLS CPI releases | Wed. Aug 30th, 2023 | | BLS CES releases | Wed. Aug 30th, 2023 | | BLS historical reports | Wed. Aug 30th, 2023 | </center> ### updates <small>this repo updates daily at 6AM EST</small> | SEC Filngs | Federal Reserve transcripts | releases & reports | | :--- | :----: | ---: | | 30 minutes | daily | daily | <small>🐒 **corpus** ⌨️ updated daily</small> ### stats and delta <center> <img style="max-width:100%" src="https://huggingface.co/datasets/dylanalloy/swan/resolve/main/words_sizes.png"> <img style="max-width:100%" src="https://huggingface.co/datasets/dylanalloy/swan/resolve/main/vocab_sizes_time.png"> </center> ### organization - *.csv: tracker - corpus.txt: collated text from all documents across all categories (designed for base model training) - corpus_deduped.txt: unique lines of corpus

提供机构：

dylanalloy

原始信息汇总

数据集概述

基本信息

许可证: cc-by-nc-4.0
任务类别: text-generation
语言: en
标签: finance, legal
名称: swan - finance dataset
大小类别: 100K<n<1M

配置

配置名称: default
- 数据文件:
  - split: corpus
    - path: "corpus.txt"
  - split: corpus_deduped
    - path: "corpus_deduped.txt"
  - split: sec_tracker
    - path: "all_sec_filings.csv"
  - split: leaked_tracker
    - path: "all_leaked_pdfs.csv"
  - split: fed_tracker
    - path: "all_fed_filings.csv"
  - split: bls_jolts_tracker
    - path: "all_bls_jolts.csv"
  - split: bls_cpi_tracker
    - path: "all_bls_cpi.csv"
  - split: bls_ces_tracker
    - path: "all_bls_ces.csv"
  - split: bls_historical_tracker
    - path: "all_bls_historical.csv"

数据更新

更新频率: 每日 6AM EST
具体更新时间:
- SEC Filings: 30 分钟
- Federal Reserve transcripts: 每日
- releases & reports: 每日

数据组织

文件类型:
- *.csv: tracker
- corpus.txt: 所有类别文档的合并文本（用于基础模型训练）
- corpus_deduped.txt: corpus 中的唯一行

5,000+

优质数据集

54 个

任务类型

进入经典数据集