dylanalloy/swan
收藏Hugging Face2023-11-21 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/dylanalloy/swan
下载链接
链接失效反馈官方服务:
资源简介:
---
license: cc-by-nc-4.0
task_categories:
- text-generation
language:
- en
tags:
- finance
- legal
pretty_name: swan - finance dataset
size_categories:
- 100K<n<1M
configs:
- config_name: default
data_files:
- split: corpus
path: "corpus.txt"
- split: corpus_deduped
path: "corpus_deduped.txt"
- split: sec_tracker
path: "all_sec_filings.csv"
- split: leaked_tracker
path: "all_leaked_pdfs.csv"
- split: fed_tracker
path: "all_fed_filings.csv"
- split: bls_jolts_tracker
path: "all_bls_jolts.csv"
- split: bls_cpi_tracker
path: "all_bls_cpi.csv"
- split: bls_ces_tracker
path: "all_bls_ces.csv"
- split: bls_historical_tracker
path: "all_bls_historical.csv"
---
<!-- header start -->
<div style="min-width:100%">
<center>
<img style="max-width:200px" src="https://huggingface.co/datasets/dylanalloy/swan/resolve/main/swan.png">
<h3>swan</h3>
<small>aggressively updated financial text dataset</small>
<a href="https://github.com/DylanAlloy/swan_scrape" target="_blank">scraping code</a>
</center>
</div>
<!-- header end -->
### usage
```python
from datasets import load_dataset
sets = ["corpus", "corpus_deduped"]
swan_data, swan_deduped = [load_dataset("dylanalloy/swan", data_files=f"{_}.txt") for _ in sets]
swan_data, swan_deduped
```
### data
<center>
| data | added |
| ----------- | ----------- |
| SEC filings | Wed. Aug 30th, 2023 |
| Federal Reserve transcripts | Wed. Aug 30th, 2023 |
| private wealth management releases | Wed. Aug 30th, 2023 |
| large bank releases | Wed. Aug 30th, 2023 |
| large fund releases | Wed. Aug 30th, 2023 |
| large trading firm releases | Wed. Aug 30th, 2023 |
| BLS JOLTS releases | Wed. Aug 30th, 2023 |
| BLS CPI releases | Wed. Aug 30th, 2023 |
| BLS CES releases | Wed. Aug 30th, 2023 |
| BLS historical reports | Wed. Aug 30th, 2023 |
</center>
### updates
<small>this repo updates daily at 6AM EST</small>
| SEC Filngs | Federal Reserve transcripts | releases & reports |
| :--- | :----: | ---: |
| 30 minutes | daily | daily |
<small>🐒 **corpus** ⌨️ updated daily</small>
### stats and delta
<center>
<img style="max-width:100%" src="https://huggingface.co/datasets/dylanalloy/swan/resolve/main/words_sizes.png">
<img style="max-width:100%" src="https://huggingface.co/datasets/dylanalloy/swan/resolve/main/vocab_sizes_time.png">
</center>
### organization
- *.csv: tracker
- corpus.txt: collated text from all documents across all categories (designed for base model training)
- corpus_deduped.txt: unique lines of corpus
提供机构:
dylanalloy
原始信息汇总
数据集概述
基本信息
- 许可证: cc-by-nc-4.0
- 任务类别: text-generation
- 语言: en
- 标签: finance, legal
- 名称: swan - finance dataset
- 大小类别: 100K<n<1M
配置
- 配置名称: default
- 数据文件:
- split: corpus
- path: "corpus.txt"
- split: corpus_deduped
- path: "corpus_deduped.txt"
- split: sec_tracker
- path: "all_sec_filings.csv"
- split: leaked_tracker
- path: "all_leaked_pdfs.csv"
- split: fed_tracker
- path: "all_fed_filings.csv"
- split: bls_jolts_tracker
- path: "all_bls_jolts.csv"
- split: bls_cpi_tracker
- path: "all_bls_cpi.csv"
- split: bls_ces_tracker
- path: "all_bls_ces.csv"
- split: bls_historical_tracker
- path: "all_bls_historical.csv"
- split: corpus
- 数据文件:
数据更新
- 更新频率: 每日 6AM EST
- 具体更新时间:
- SEC Filings: 30 分钟
- Federal Reserve transcripts: 每日
- releases & reports: 每日
数据组织
- 文件类型:
- *.csv: tracker
- corpus.txt: 所有类别文档的合并文本(用于基础模型训练)
- corpus_deduped.txt: corpus 中的唯一行



