five

eloukas/edgar-corpus

收藏
Hugging Face2023-07-14 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/eloukas/edgar-corpus
下载链接
链接失效反馈
官方服务:
资源简介:
--- dataset_info: - config_name: . features: - name: filename dtype: string - name: cik dtype: string - name: year dtype: string - name: section_1 dtype: string - name: section_1A dtype: string - name: section_1B dtype: string - name: section_2 dtype: string - name: section_3 dtype: string - name: section_4 dtype: string - name: section_5 dtype: string - name: section_6 dtype: string - name: section_7 dtype: string - name: section_7A dtype: string - name: section_8 dtype: string - name: section_9 dtype: string - name: section_9A dtype: string - name: section_9B dtype: string - name: section_10 dtype: string - name: section_11 dtype: string - name: section_12 dtype: string - name: section_13 dtype: string - name: section_14 dtype: string - name: section_15 dtype: string splits: - name: train num_bytes: 40306320885 num_examples: 220375 download_size: 10734208660 dataset_size: 40306320885 - config_name: full features: - name: filename dtype: string - name: cik dtype: string - name: year dtype: string - name: section_1 dtype: string - name: section_1A dtype: string - name: section_1B dtype: string - name: section_2 dtype: string - name: section_3 dtype: string - name: section_4 dtype: string - name: section_5 dtype: string - name: section_6 dtype: string - name: section_7 dtype: string - name: section_7A dtype: string - name: section_8 dtype: string - name: section_9 dtype: string - name: section_9A dtype: string - name: section_9B dtype: string - name: section_10 dtype: string - name: section_11 dtype: string - name: section_12 dtype: string - name: section_13 dtype: string - name: section_14 dtype: string - name: section_15 dtype: string splits: - name: train num_bytes: 32237457024 num_examples: 176289 - name: validation num_bytes: 4023129683 num_examples: 22050 - name: test num_bytes: 4045734178 num_examples: 22036 download_size: 40699852536 dataset_size: 40306320885 - config_name: year_1993 features: - name: filename dtype: string - name: cik dtype: string - name: year dtype: string - name: section_1 dtype: string - name: section_1A dtype: string - name: section_1B dtype: string - name: section_2 dtype: string - name: section_3 dtype: string - name: section_4 dtype: string - name: section_5 dtype: string - name: section_6 dtype: string - name: section_7 dtype: string - name: section_7A dtype: string - name: section_8 dtype: string - name: section_9 dtype: string - name: section_9A dtype: string - name: section_9B dtype: string - name: section_10 dtype: string - name: section_11 dtype: string - name: section_12 dtype: string - name: section_13 dtype: string - name: section_14 dtype: string - name: section_15 dtype: string splits: - name: train num_bytes: 112714537 num_examples: 1060 - name: validation num_bytes: 13584432 num_examples: 133 - name: test num_bytes: 14520566 num_examples: 133 download_size: 141862572 dataset_size: 140819535 - config_name: year_1994 features: - name: filename dtype: string - name: cik dtype: string - name: year dtype: string - name: section_1 dtype: string - name: section_1A dtype: string - name: section_1B dtype: string - name: section_2 dtype: string - name: section_3 dtype: string - name: section_4 dtype: string - name: section_5 dtype: string - name: section_6 dtype: string - name: section_7 dtype: string - name: section_7A dtype: string - name: section_8 dtype: string - name: section_9 dtype: string - name: section_9A dtype: string - name: section_9B dtype: string - name: section_10 dtype: string - name: section_11 dtype: string - name: section_12 dtype: string - name: section_13 dtype: string - name: section_14 dtype: string - name: section_15 dtype: string splits: - name: train num_bytes: 198955093 num_examples: 2083 - name: validation num_bytes: 23432307 num_examples: 261 - name: test num_bytes: 26115768 num_examples: 260 download_size: 250411041 dataset_size: 248503168 - config_name: year_1995 features: - name: filename dtype: string - name: cik dtype: string - name: year dtype: string - name: section_1 dtype: string - name: section_1A dtype: string - name: section_1B dtype: string - name: section_2 dtype: string - name: section_3 dtype: string - name: section_4 dtype: string - name: section_5 dtype: string - name: section_6 dtype: string - name: section_7 dtype: string - name: section_7A dtype: string - name: section_8 dtype: string - name: section_9 dtype: string - name: section_9A dtype: string - name: section_9B dtype: string - name: section_10 dtype: string - name: section_11 dtype: string - name: section_12 dtype: string - name: section_13 dtype: string - name: section_14 dtype: string - name: section_15 dtype: string splits: - name: train num_bytes: 356959049 num_examples: 4110 - name: validation num_bytes: 42781161 num_examples: 514 - name: test num_bytes: 45275568 num_examples: 514 download_size: 448617549 dataset_size: 445015778 - config_name: year_1996 features: - name: filename dtype: string - name: cik dtype: string - name: year dtype: string - name: section_1 dtype: string - name: section_1A dtype: string - name: section_1B dtype: string - name: section_2 dtype: string - name: section_3 dtype: string - name: section_4 dtype: string - name: section_5 dtype: string - name: section_6 dtype: string - name: section_7 dtype: string - name: section_7A dtype: string - name: section_8 dtype: string - name: section_9 dtype: string - name: section_9A dtype: string - name: section_9B dtype: string - name: section_10 dtype: string - name: section_11 dtype: string - name: section_12 dtype: string - name: section_13 dtype: string - name: section_14 dtype: string - name: section_15 dtype: string splits: - name: train num_bytes: 738506135 num_examples: 7589 - name: validation num_bytes: 89873905 num_examples: 949 - name: test num_bytes: 91248882 num_examples: 949 download_size: 926536700 dataset_size: 919628922 - config_name: year_1997 features: - name: filename dtype: string - name: cik dtype: string - name: year dtype: string - name: section_1 dtype: string - name: section_1A dtype: string - name: section_1B dtype: string - name: section_2 dtype: string - name: section_3 dtype: string - name: section_4 dtype: string - name: section_5 dtype: string - name: section_6 dtype: string - name: section_7 dtype: string - name: section_7A dtype: string - name: section_8 dtype: string - name: section_9 dtype: string - name: section_9A dtype: string - name: section_9B dtype: string - name: section_10 dtype: string - name: section_11 dtype: string - name: section_12 dtype: string - name: section_13 dtype: string - name: section_14 dtype: string - name: section_15 dtype: string splits: - name: train num_bytes: 854201733 num_examples: 8084 - name: validation num_bytes: 103167272 num_examples: 1011 - name: test num_bytes: 106843950 num_examples: 1011 download_size: 1071898139 dataset_size: 1064212955 - config_name: year_1998 features: - name: filename dtype: string - name: cik dtype: string - name: year dtype: string - name: section_1 dtype: string - name: section_1A dtype: string - name: section_1B dtype: string - name: section_2 dtype: string - name: section_3 dtype: string - name: section_4 dtype: string - name: section_5 dtype: string - name: section_6 dtype: string - name: section_7 dtype: string - name: section_7A dtype: string - name: section_8 dtype: string - name: section_9 dtype: string - name: section_9A dtype: string - name: section_9B dtype: string - name: section_10 dtype: string - name: section_11 dtype: string - name: section_12 dtype: string - name: section_13 dtype: string - name: section_14 dtype: string - name: section_15 dtype: string splits: - name: train num_bytes: 904075497 num_examples: 8040 - name: validation num_bytes: 112630658 num_examples: 1006 - name: test num_bytes: 113308750 num_examples: 1005 download_size: 1137887615 dataset_size: 1130014905 - config_name: year_1999 features: - name: filename dtype: string - name: cik dtype: string - name: year dtype: string - name: section_1 dtype: string - name: section_1A dtype: string - name: section_1B dtype: string - name: section_2 dtype: string - name: section_3 dtype: string - name: section_4 dtype: string - name: section_5 dtype: string - name: section_6 dtype: string - name: section_7 dtype: string - name: section_7A dtype: string - name: section_8 dtype: string - name: section_9 dtype: string - name: section_9A dtype: string - name: section_9B dtype: string - name: section_10 dtype: string - name: section_11 dtype: string - name: section_12 dtype: string - name: section_13 dtype: string - name: section_14 dtype: string - name: section_15 dtype: string splits: - name: train num_bytes: 911374885 num_examples: 7864 - name: validation num_bytes: 118614261 num_examples: 984 - name: test num_bytes: 116706581 num_examples: 983 download_size: 1154736765 dataset_size: 1146695727 - config_name: year_2000 features: - name: filename dtype: string - name: cik dtype: string - name: year dtype: string - name: section_1 dtype: string - name: section_1A dtype: string - name: section_1B dtype: string - name: section_2 dtype: string - name: section_3 dtype: string - name: section_4 dtype: string - name: section_5 dtype: string - name: section_6 dtype: string - name: section_7 dtype: string - name: section_7A dtype: string - name: section_8 dtype: string - name: section_9 dtype: string - name: section_9A dtype: string - name: section_9B dtype: string - name: section_10 dtype: string - name: section_11 dtype: string - name: section_12 dtype: string - name: section_13 dtype: string - name: section_14 dtype: string - name: section_15 dtype: string splits: - name: train num_bytes: 926444625 num_examples: 7589 - name: validation num_bytes: 113264749 num_examples: 949 - name: test num_bytes: 114605470 num_examples: 949 download_size: 1162526814 dataset_size: 1154314844 - config_name: year_2001 features: - name: filename dtype: string - name: cik dtype: string - name: year dtype: string - name: section_1 dtype: string - name: section_1A dtype: string - name: section_1B dtype: string - name: section_2 dtype: string - name: section_3 dtype: string - name: section_4 dtype: string - name: section_5 dtype: string - name: section_6 dtype: string - name: section_7 dtype: string - name: section_7A dtype: string - name: section_8 dtype: string - name: section_9 dtype: string - name: section_9A dtype: string - name: section_9B dtype: string - name: section_10 dtype: string - name: section_11 dtype: string - name: section_12 dtype: string - name: section_13 dtype: string - name: section_14 dtype: string - name: section_15 dtype: string splits: - name: train num_bytes: 964631161 num_examples: 7181 - name: validation num_bytes: 117509010 num_examples: 898 - name: test num_bytes: 116141097 num_examples: 898 download_size: 1207790205 dataset_size: 1198281268 - config_name: year_2002 features: - name: filename dtype: string - name: cik dtype: string - name: year dtype: string - name: section_1 dtype: string - name: section_1A dtype: string - name: section_1B dtype: string - name: section_2 dtype: string - name: section_3 dtype: string - name: section_4 dtype: string - name: section_5 dtype: string - name: section_6 dtype: string - name: section_7 dtype: string - name: section_7A dtype: string - name: section_8 dtype: string - name: section_9 dtype: string - name: section_9A dtype: string - name: section_9B dtype: string - name: section_10 dtype: string - name: section_11 dtype: string - name: section_12 dtype: string - name: section_13 dtype: string - name: section_14 dtype: string - name: section_15 dtype: string splits: - name: train num_bytes: 1049271720 num_examples: 6636 - name: validation num_bytes: 128339491 num_examples: 830 - name: test num_bytes: 128444184 num_examples: 829 download_size: 1317817728 dataset_size: 1306055395 - config_name: year_2003 features: - name: filename dtype: string - name: cik dtype: string - name: year dtype: string - name: section_1 dtype: string - name: section_1A dtype: string - name: section_1B dtype: string - name: section_2 dtype: string - name: section_3 dtype: string - name: section_4 dtype: string - name: section_5 dtype: string - name: section_6 dtype: string - name: section_7 dtype: string - name: section_7A dtype: string - name: section_8 dtype: string - name: section_9 dtype: string - name: section_9A dtype: string - name: section_9B dtype: string - name: section_10 dtype: string - name: section_11 dtype: string - name: section_12 dtype: string - name: section_13 dtype: string - name: section_14 dtype: string - name: section_15 dtype: string splits: - name: train num_bytes: 1027557690 num_examples: 6672 - name: validation num_bytes: 126684704 num_examples: 834 - name: test num_bytes: 130672979 num_examples: 834 download_size: 1297227566 dataset_size: 1284915373 - config_name: year_2004 features: - name: filename dtype: string - name: cik dtype: string - name: year dtype: string - name: section_1 dtype: string - name: section_1A dtype: string - name: section_1B dtype: string - name: section_2 dtype: string - name: section_3 dtype: string - name: section_4 dtype: string - name: section_5 dtype: string - name: section_6 dtype: string - name: section_7 dtype: string - name: section_7A dtype: string - name: section_8 dtype: string - name: section_9 dtype: string - name: section_9A dtype: string - name: section_9B dtype: string - name: section_10 dtype: string - name: section_11 dtype: string - name: section_12 dtype: string - name: section_13 dtype: string - name: section_14 dtype: string - name: section_15 dtype: string splits: - name: train num_bytes: 1129657843 num_examples: 7111 - name: validation num_bytes: 147499772 num_examples: 889 - name: test num_bytes: 147890092 num_examples: 889 download_size: 1439663100 dataset_size: 1425047707 - config_name: year_2005 features: - name: filename dtype: string - name: cik dtype: string - name: year dtype: string - name: section_1 dtype: string - name: section_1A dtype: string - name: section_1B dtype: string - name: section_2 dtype: string - name: section_3 dtype: string - name: section_4 dtype: string - name: section_5 dtype: string - name: section_6 dtype: string - name: section_7 dtype: string - name: section_7A dtype: string - name: section_8 dtype: string - name: section_9 dtype: string - name: section_9A dtype: string - name: section_9B dtype: string - name: section_10 dtype: string - name: section_11 dtype: string - name: section_12 dtype: string - name: section_13 dtype: string - name: section_14 dtype: string - name: section_15 dtype: string splits: - name: train num_bytes: 1200714441 num_examples: 7113 - name: validation num_bytes: 161003977 num_examples: 890 - name: test num_bytes: 160727195 num_examples: 889 download_size: 1538876195 dataset_size: 1522445613 - config_name: year_2006 features: - name: filename dtype: string - name: cik dtype: string - name: year dtype: string - name: section_1 dtype: string - name: section_1A dtype: string - name: section_1B dtype: string - name: section_2 dtype: string - name: section_3 dtype: string - name: section_4 dtype: string - name: section_5 dtype: string - name: section_6 dtype: string - name: section_7 dtype: string - name: section_7A dtype: string - name: section_8 dtype: string - name: section_9 dtype: string - name: section_9A dtype: string - name: section_9B dtype: string - name: section_10 dtype: string - name: section_11 dtype: string - name: section_12 dtype: string - name: section_13 dtype: string - name: section_14 dtype: string - name: section_15 dtype: string splits: - name: train num_bytes: 1286566049 num_examples: 7064 - name: validation num_bytes: 160843494 num_examples: 883 - name: test num_bytes: 163270601 num_examples: 883 download_size: 1628452618 dataset_size: 1610680144 - config_name: year_2007 features: - name: filename dtype: string - name: cik dtype: string - name: year dtype: string - name: section_1 dtype: string - name: section_1A dtype: string - name: section_1B dtype: string - name: section_2 dtype: string - name: section_3 dtype: string - name: section_4 dtype: string - name: section_5 dtype: string - name: section_6 dtype: string - name: section_7 dtype: string - name: section_7A dtype: string - name: section_8 dtype: string - name: section_9 dtype: string - name: section_9A dtype: string - name: section_9B dtype: string - name: section_10 dtype: string - name: section_11 dtype: string - name: section_12 dtype: string - name: section_13 dtype: string - name: section_14 dtype: string - name: section_15 dtype: string splits: - name: train num_bytes: 1296737173 num_examples: 6683 - name: validation num_bytes: 166735560 num_examples: 836 - name: test num_bytes: 156399535 num_examples: 835 download_size: 1637502176 dataset_size: 1619872268 - config_name: year_2008 features: - name: filename dtype: string - name: cik dtype: string - name: year dtype: string - name: section_1 dtype: string - name: section_1A dtype: string - name: section_1B dtype: string - name: section_2 dtype: string - name: section_3 dtype: string - name: section_4 dtype: string - name: section_5 dtype: string - name: section_6 dtype: string - name: section_7 dtype: string - name: section_7A dtype: string - name: section_8 dtype: string - name: section_9 dtype: string - name: section_9A dtype: string - name: section_9B dtype: string - name: section_10 dtype: string - name: section_11 dtype: string - name: section_12 dtype: string - name: section_13 dtype: string - name: section_14 dtype: string - name: section_15 dtype: string splits: - name: train num_bytes: 1525698198 num_examples: 7408 - name: validation num_bytes: 190034435 num_examples: 927 - name: test num_bytes: 187659976 num_examples: 926 download_size: 1924164839 dataset_size: 1903392609 - config_name: year_2009 features: - name: filename dtype: string - name: cik dtype: string - name: year dtype: string - name: section_1 dtype: string - name: section_1A dtype: string - name: section_1B dtype: string - name: section_2 dtype: string - name: section_3 dtype: string - name: section_4 dtype: string - name: section_5 dtype: string - name: section_6 dtype: string - name: section_7 dtype: string - name: section_7A dtype: string - name: section_8 dtype: string - name: section_9 dtype: string - name: section_9A dtype: string - name: section_9B dtype: string - name: section_10 dtype: string - name: section_11 dtype: string - name: section_12 dtype: string - name: section_13 dtype: string - name: section_14 dtype: string - name: section_15 dtype: string splits: - name: train num_bytes: 1547816260 num_examples: 7336 - name: validation num_bytes: 188897783 num_examples: 917 - name: test num_bytes: 196463897 num_examples: 917 download_size: 1954076983 dataset_size: 1933177940 - config_name: year_2010 features: - name: filename dtype: string - name: cik dtype: string - name: year dtype: string - name: section_1 dtype: string - name: section_1A dtype: string - name: section_1B dtype: string - name: section_2 dtype: string - name: section_3 dtype: string - name: section_4 dtype: string - name: section_5 dtype: string - name: section_6 dtype: string - name: section_7 dtype: string - name: section_7A dtype: string - name: section_8 dtype: string - name: section_9 dtype: string - name: section_9A dtype: string - name: section_9B dtype: string - name: section_10 dtype: string - name: section_11 dtype: string - name: section_12 dtype: string - name: section_13 dtype: string - name: section_14 dtype: string - name: section_15 dtype: string splits: - name: train num_bytes: 1493505900 num_examples: 7013 - name: validation num_bytes: 192695567 num_examples: 877 - name: test num_bytes: 191482640 num_examples: 877 download_size: 1897687327 dataset_size: 1877684107 - config_name: year_2011 features: - name: filename dtype: string - name: cik dtype: string - name: year dtype: string - name: section_1 dtype: string - name: section_1A dtype: string - name: section_1B dtype: string - name: section_2 dtype: string - name: section_3 dtype: string - name: section_4 dtype: string - name: section_5 dtype: string - name: section_6 dtype: string - name: section_7 dtype: string - name: section_7A dtype: string - name: section_8 dtype: string - name: section_9 dtype: string - name: section_9A dtype: string - name: section_9B dtype: string - name: section_10 dtype: string - name: section_11 dtype: string - name: section_12 dtype: string - name: section_13 dtype: string - name: section_14 dtype: string - name: section_15 dtype: string splits: - name: train num_bytes: 1481486551 num_examples: 6724 - name: validation num_bytes: 190781558 num_examples: 841 - name: test num_bytes: 185869151 num_examples: 840 download_size: 1877396421 dataset_size: 1858137260 - config_name: year_2012 features: - name: filename dtype: string - name: cik dtype: string - name: year dtype: string - name: section_1 dtype: string - name: section_1A dtype: string - name: section_1B dtype: string - name: section_2 dtype: string - name: section_3 dtype: string - name: section_4 dtype: string - name: section_5 dtype: string - name: section_6 dtype: string - name: section_7 dtype: string - name: section_7A dtype: string - name: section_8 dtype: string - name: section_9 dtype: string - name: section_9A dtype: string - name: section_9B dtype: string - name: section_10 dtype: string - name: section_11 dtype: string - name: section_12 dtype: string - name: section_13 dtype: string - name: section_14 dtype: string - name: section_15 dtype: string splits: - name: train num_bytes: 1463496224 num_examples: 6479 - name: validation num_bytes: 186247306 num_examples: 810 - name: test num_bytes: 185923601 num_examples: 810 download_size: 1854377191 dataset_size: 1835667131 - config_name: year_2013 features: - name: filename dtype: string - name: cik dtype: string - name: year dtype: string - name: section_1 dtype: string - name: section_1A dtype: string - name: section_1B dtype: string - name: section_2 dtype: string - name: section_3 dtype: string - name: section_4 dtype: string - name: section_5 dtype: string - name: section_6 dtype: string - name: section_7 dtype: string - name: section_7A dtype: string - name: section_8 dtype: string - name: section_9 dtype: string - name: section_9A dtype: string - name: section_9B dtype: string - name: section_10 dtype: string - name: section_11 dtype: string - name: section_12 dtype: string - name: section_13 dtype: string - name: section_14 dtype: string - name: section_15 dtype: string splits: - name: train num_bytes: 1468172419 num_examples: 6372 - name: validation num_bytes: 183570866 num_examples: 797 - name: test num_bytes: 182495750 num_examples: 796 download_size: 1852839009 dataset_size: 1834239035 - config_name: year_2014 features: - name: filename dtype: string - name: cik dtype: string - name: year dtype: string - name: section_1 dtype: string - name: section_1A dtype: string - name: section_1B dtype: string - name: section_2 dtype: string - name: section_3 dtype: string - name: section_4 dtype: string - name: section_5 dtype: string - name: section_6 dtype: string - name: section_7 dtype: string - name: section_7A dtype: string - name: section_8 dtype: string - name: section_9 dtype: string - name: section_9A dtype: string - name: section_9B dtype: string - name: section_10 dtype: string - name: section_11 dtype: string - name: section_12 dtype: string - name: section_13 dtype: string - name: section_14 dtype: string - name: section_15 dtype: string splits: - name: train num_bytes: 1499451593 num_examples: 6261 - name: validation num_bytes: 181568907 num_examples: 783 - name: test num_bytes: 181046535 num_examples: 783 download_size: 1880963095 dataset_size: 1862067035 - config_name: year_2015 features: - name: filename dtype: string - name: cik dtype: string - name: year dtype: string - name: section_1 dtype: string - name: section_1A dtype: string - name: section_1B dtype: string - name: section_2 dtype: string - name: section_3 dtype: string - name: section_4 dtype: string - name: section_5 dtype: string - name: section_6 dtype: string - name: section_7 dtype: string - name: section_7A dtype: string - name: section_8 dtype: string - name: section_9 dtype: string - name: section_9A dtype: string - name: section_9B dtype: string - name: section_10 dtype: string - name: section_11 dtype: string - name: section_12 dtype: string - name: section_13 dtype: string - name: section_14 dtype: string - name: section_15 dtype: string splits: - name: train num_bytes: 1472346721 num_examples: 6028 - name: validation num_bytes: 180128910 num_examples: 754 - name: test num_bytes: 189210252 num_examples: 753 download_size: 1860303134 dataset_size: 1841685883 - config_name: year_2016 features: - name: filename dtype: string - name: cik dtype: string - name: year dtype: string - name: section_1 dtype: string - name: section_1A dtype: string - name: section_1B dtype: string - name: section_2 dtype: string - name: section_3 dtype: string - name: section_4 dtype: string - name: section_5 dtype: string - name: section_6 dtype: string - name: section_7 dtype: string - name: section_7A dtype: string - name: section_8 dtype: string - name: section_9 dtype: string - name: section_9A dtype: string - name: section_9B dtype: string - name: section_10 dtype: string - name: section_11 dtype: string - name: section_12 dtype: string - name: section_13 dtype: string - name: section_14 dtype: string - name: section_15 dtype: string splits: - name: train num_bytes: 1471605426 num_examples: 5812 - name: validation num_bytes: 178310005 num_examples: 727 - name: test num_bytes: 177481471 num_examples: 727 download_size: 1845967492 dataset_size: 1827396902 - config_name: year_2017 features: - name: filename dtype: string - name: cik dtype: string - name: year dtype: string - name: section_1 dtype: string - name: section_1A dtype: string - name: section_1B dtype: string - name: section_2 dtype: string - name: section_3 dtype: string - name: section_4 dtype: string - name: section_5 dtype: string - name: section_6 dtype: string - name: section_7 dtype: string - name: section_7A dtype: string - name: section_8 dtype: string - name: section_9 dtype: string - name: section_9A dtype: string - name: section_9B dtype: string - name: section_10 dtype: string - name: section_11 dtype: string - name: section_12 dtype: string - name: section_13 dtype: string - name: section_14 dtype: string - name: section_15 dtype: string splits: - name: train num_bytes: 1459021126 num_examples: 5635 - name: validation num_bytes: 174360913 num_examples: 705 - name: test num_bytes: 184398250 num_examples: 704 download_size: 1836306408 dataset_size: 1817780289 - config_name: year_2018 features: - name: filename dtype: string - name: cik dtype: string - name: year dtype: string - name: section_1 dtype: string - name: section_1A dtype: string - name: section_1B dtype: string - name: section_2 dtype: string - name: section_3 dtype: string - name: section_4 dtype: string - name: section_5 dtype: string - name: section_6 dtype: string - name: section_7 dtype: string - name: section_7A dtype: string - name: section_8 dtype: string - name: section_9 dtype: string - name: section_9A dtype: string - name: section_9B dtype: string - name: section_10 dtype: string - name: section_11 dtype: string - name: section_12 dtype: string - name: section_13 dtype: string - name: section_14 dtype: string - name: section_15 dtype: string splits: - name: train num_bytes: 1433409319 num_examples: 5508 - name: validation num_bytes: 181466460 num_examples: 689 - name: test num_bytes: 182594965 num_examples: 688 download_size: 1815810567 dataset_size: 1797470744 - config_name: year_2019 features: - name: filename dtype: string - name: cik dtype: string - name: year dtype: string - name: section_1 dtype: string - name: section_1A dtype: string - name: section_1B dtype: string - name: section_2 dtype: string - name: section_3 dtype: string - name: section_4 dtype: string - name: section_5 dtype: string - name: section_6 dtype: string - name: section_7 dtype: string - name: section_7A dtype: string - name: section_8 dtype: string - name: section_9 dtype: string - name: section_9A dtype: string - name: section_9B dtype: string - name: section_10 dtype: string - name: section_11 dtype: string - name: section_12 dtype: string - name: section_13 dtype: string - name: section_14 dtype: string - name: section_15 dtype: string splits: - name: train num_bytes: 1421232269 num_examples: 5354 - name: validation num_bytes: 175603562 num_examples: 670 - name: test num_bytes: 176336174 num_examples: 669 download_size: 1791237155 dataset_size: 1773172005 - config_name: year_2020 features: - name: filename dtype: string - name: cik dtype: string - name: year dtype: string - name: section_1 dtype: string - name: section_1A dtype: string - name: section_1B dtype: string - name: section_2 dtype: string - name: section_3 dtype: string - name: section_4 dtype: string - name: section_5 dtype: string - name: section_6 dtype: string - name: section_7 dtype: string - name: section_7A dtype: string - name: section_8 dtype: string - name: section_9 dtype: string - name: section_9A dtype: string - name: section_9B dtype: string - name: section_10 dtype: string - name: section_11 dtype: string - name: section_12 dtype: string - name: section_13 dtype: string - name: section_14 dtype: string - name: section_15 dtype: string splits: - name: train num_bytes: 1541847387 num_examples: 5480 - name: validation num_bytes: 193498658 num_examples: 686 - name: test num_bytes: 192600298 num_examples: 685 download_size: 1946916132 dataset_size: 1927946343 annotations_creators: - no-annotation language: - en language_creators: - other license: - apache-2.0 multilinguality: - monolingual pretty_name: EDGAR-CORPUS (10-K Filings from 1999 to 2020) size_categories: - 100K<n<1M source_datasets: - extended|other tags: - research papers - edgar - sec - finance - financial - filings - 10K - 10-K - nlp - research - econlp - economics - business task_categories: - other task_ids: [] --- # Dataset Card for [EDGAR-CORPUS] ## Table of Contents - [Table of Contents](#table-of-contents) - [Dataset Description](#dataset-description) - [Dataset Summary](#dataset-summary) - [Supported Tasks](#supported-tasks) - [Languages](#languages) - [Dataset Structure](#dataset-structure) - [Data Instances](#data-instances) - [Data Fields](#data-fields) - [Data Splits](#data-splits) - [Dataset Creation](#dataset-creation) - [Source Data](#source-data) - [Annotations](#annotations) - [Personal and Sensitive Information](#personal-and-sensitive-information) - [Considerations for Using the Data](#considerations-for-using-the-data) - [Social Impact of Dataset](#social-impact-of-dataset) - [Discussion of Biases](#discussion-of-biases) - [Other Known Limitations](#other-known-limitations) - [Additional Information](#additional-information) - [Licensing Information](#licensing-information) - [References](#references) - [Contributions](#contributions) ## Dataset Description - **Point of Contact: Lefteris Loukas** ### Dataset Summary This dataset card is based on the paper **EDGAR-CORPUS: Billions of Tokens Make The World Go Round** authored by _Lefteris Loukas et.al_, as published in the _ECONLP 2021_ workshop. This dataset contains the annual reports of public companies from 1993-2020 from SEC EDGAR filings. There is supported functionality to load a specific year. Care: since this is a corpus dataset, different `train/val/test` splits do not have any special meaning. It's the default HF card format to have train/val/test splits. If you wish to load specific year(s) of specific companies, you probably want to use the open-source software which generated this dataset, EDGAR-CRAWLER: https://github.com/nlpaueb/edgar-crawler. ## Citation If this work helps or inspires you in any way, please consider citing the relevant paper published at the [3rd Economics and Natural Language Processing (ECONLP) workshop](https://lt3.ugent.be/econlp/) at EMNLP 2021 (Punta Cana, Dominican Republic): ``` @inproceedings{loukas-etal-2021-edgar, title = "{EDGAR}-{CORPUS}: Billions of Tokens Make The World Go Round", author = "Loukas, Lefteris and Fergadiotis, Manos and Androutsopoulos, Ion and Malakasiotis, Prodromos", booktitle = "Proceedings of the Third Workshop on Economics and Natural Language Processing", month = nov, year = "2021", address = "Punta Cana, Dominican Republic", publisher = "Association for Computational Linguistics", url = "https://aclanthology.org/2021.econlp-1.2", pages = "13--18", } ``` ### Supported Tasks This is a raw dataset/corpus for financial NLP. As such, there are no annotations or labels. ### Languages The EDGAR Filings are in English. ## Dataset Structure ### Data Instances Refer to the dataset preview. ### Data Fields **filename**: Name of file on EDGAR from which the report was extracted.<br> **cik**: EDGAR identifier for a firm.<br> **year**: Year of report.<br> **section_1**: Corressponding section of the Annual Report.<br> **section_1A**: Corressponding section of the Annual Report.<br> **section_1B**: Corressponding section of the Annual Report.<br> **section_2**: Corressponding section of the Annual Report.<br> **section_3**: Corressponding section of the Annual Report.<br> **section_4**: Corressponding section of the Annual Report.<br> **section_5**: Corressponding section of the Annual Report.<br> **section_6**: Corressponding section of the Annual Report.<br> **section_7**: Corressponding section of the Annual Report.<br> **section_7A**: Corressponding section of the Annual Report.<br> **section_8**: Corressponding section of the Annual Report.<br> **section_9**: Corressponding section of the Annual Report.<br> **section_9A**: Corressponding section of the Annual Report.<br> **section_9B**: Corressponding section of the Annual Report.<br> **section_10**: Corressponding section of the Annual Report.<br> **section_11**: Corressponding section of the Annual Report.<br> **section_12**: Corressponding section of the Annual Report.<br> **section_13**: Corressponding section of the Annual Report.<br> **section_14**: Corressponding section of the Annual Report.<br> **section_15**: Corressponding section of the Annual Report.<br> ```python import datasets # Load the entire dataset raw_dataset = datasets.load_dataset("eloukas/edgar-corpus", "full") # Load a specific year and split year_1993_training_dataset = datasets.load_dataset("eloukas/edgar-corpus", "year_1993", split="train") ``` ### Data Splits | Config | Training | Validation | Test | | --------- | -------- | ---------- | ------ | | full | 176,289 | 22,050 | 22,036 | | year_1993 | 1,060 | 133 | 133 | | year_1994 | 2,083 | 261 | 260 | | year_1995 | 4,110 | 514 | 514 | | year_1996 | 7,589 | 949 | 949 | | year_1997 | 8,084 | 1,011 | 1,011 | | year_1998 | 8,040 | 1,006 | 1,005 | | year_1999 | 7,864 | 984 | 983 | | year_2000 | 7,589 | 949 | 949 | | year_2001 | 7,181 | 898 | 898 | | year_2002 | 6,636 | 830 | 829 | | year_2003 | 6,672 | 834 | 834 | | year_2004 | 7,111 | 889 | 889 | | year_2005 | 7,113 | 890 | 889 | | year_2006 | 7,064 | 883 | 883 | | year_2007 | 6,683 | 836 | 835 | | year_2008 | 7,408 | 927 | 926 | | year_2009 | 7,336 | 917 | 917 | | year_2010 | 7,013 | 877 | 877 | | year_2011 | 6,724 | 841 | 840 | | year_2012 | 6,479 | 810 | 810 | | year_2013 | 6,372 | 797 | 796 | | year_2014 | 6,261 | 783 | 783 | | year_2015 | 6,028 | 754 | 753 | | year_2016 | 5,812 | 727 | 727 | | year_2017 | 5,635 | 705 | 704 | | year_2018 | 5,508 | 689 | 688 | | year_2019 | 5,354 | 670 | 669 | | year_2020 | 5,480 | 686 | 685 | ## Dataset Creation ### Source Data #### Initial Data Collection and Normalization Initial data was collected and processed by the authors of the research paper **EDGAR-CORPUS: Billions of Tokens Make The World Go Round**. #### Who are the source language producers? Public firms filing with the SEC. ### Annotations #### Annotation process NA #### Who are the annotators? NA ### Personal and Sensitive Information The dataset contains public filings data from SEC. ## Considerations for Using the Data ### Social Impact of Dataset Low to none. ### Discussion of Biases The dataset is about financial information of public companies and as such the tone and style of text is in line with financial literature. ### Other Known Limitations The dataset needs further cleaning for improved performance. ## Additional Information ### Licensing Information EDGAR data is publicly available. ### Shoutout Huge shoutout to [@JanosAudran](https://huggingface.co/JanosAudran) for the HF Card setup! ### References - [Research Paper] Lefteris Loukas, Manos Fergadiotis, Ion Androutsopoulos, and, Prodromos Malakasiotis. EDGAR-CORPUS: Billions of Tokens Make The World Go Round. Third Workshop on Economics and Natural Language Processing (ECONLP). https://arxiv.org/abs/2109.14394 - Punta Cana, Dominican Republic, November 2021. - [Software] Lefteris Loukas, Manos Fergadiotis, Ion Androutsopoulos, and, Prodromos Malakasiotis. EDGAR-CRAWLER. https://github.com/nlpaueb/edgar-crawler (2021) - [EDGAR CORPUS, but in zip files] EDGAR CORPUS: A corpus for financial NLP research, built from SEC's EDGAR. https://zenodo.org/record/5528490 (2021) - [Word Embeddings] EDGAR-W2V: Word2vec Embeddings trained on EDGAR-CORPUS. https://zenodo.org/record/5524358 (2021) - [Applied Research paper where EDGAR-CORPUS is used] Lefteris Loukas, Manos Fergadiotis, Ilias Chalkidis, Eirini Spyropoulou, Prodromos Malakasiotis, Ion Androutsopoulos, and, George Paliouras. FiNER: Financial Numeric Entity Recognition for XBRL Tagging. Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). https://doi.org/10.18653/v1/2022.acl-long.303 (2022)
提供机构:
eloukas
原始信息汇总

数据集概述

数据集配置信息

  1. 配置名称:

    • .: 默认配置
    • full: 完整数据集配置
    • year_1993year_2008: 按年份划分的数据集配置
  2. 数据集特征:

    • 所有配置共享相同的特征结构,包括:
      • filename: 数据类型为字符串
      • cik: 数据类型为字符串
      • year: 数据类型为字符串
      • section_1section_15: 数据类型均为字符串
  3. 数据集拆分:

    • 每个配置包含以下拆分:
      • train: 训练集
      • validation: 验证集
      • test: 测试集
    • 每个拆分的详细信息包括:
      • num_bytes: 数据大小(字节)
      • num_examples: 样本数量
  4. 数据集大小:

    • 每个配置的dataset_size表示总的数据集大小。
    • 例如,full配置的dataset_size为40306320885字节。
  5. 下载大小:

    • 每个配置的download_size表示下载所需的数据大小。
    • 例如,full配置的download_size为40699852536字节。

示例数据集配置详情

  • 配置名称: full

    • 特征: 同上
    • 拆分:
      • train: 32237457024字节,176289样本
      • validation: 4023129683字节,22050样本
      • test: 4045734178字节,22036样本
    • 数据集大小: 40306320885字节
    • 下载大小: 40699852536字节
  • 配置名称: year_1993

    • 特征: 同上
    • 拆分:
      • train: 112714537字节,1060样本
      • validation: 13584432字节,133样本
      • test: 14520566字节,133样本
    • 数据集大小: 140819535字节
    • 下载大小: 141862572字节
  • 配置名称: year_2008

    • 特征: 同上
    • 拆分:
      • train: 数据未提供
      • validation: 数据未提供
      • test: 数据未提供
    • 数据集大小: 数据未提供
    • 下载大小: 数据未提供

以上信息基于提供的README文件内容,未提供的数据无法进行总结。

搜集汇总
数据集介绍
main_image_url
以上内容由遇见数据集搜集并总结生成
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作