JonusNattapong/Ratchakitcha
收藏Hugging Face2025-12-28 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/JonusNattapong/Ratchakitcha
下载链接
链接失效反馈官方服务:
资源简介:
---
license: cc-by-4.0
task_categories:
- text-retrieval
- text-generation
- question-answering
language:
- th
pretty_name: Royal Gazette Thailand (Ratchakitcha)
size_categories:
- 1M<n<10M
tags:
- law
- thailand
- government
- open-data
dataset_info:
- config_name: ocr
configs:
- config_name: ocr
data_files:
- split: train
path: ocr/**/*.jsonl
- config_name: meta
data_files:
- split: train
path: meta/**/*.jsonl
- config_name: pdf_recent
data_files:
- split: train
path: pdf/**/*.pdf
- config_name: pdf_archive
data_files:
- split: train
path: zip/**/*.zip
- config_name: subset_2020s
data_files:
- split: train
path:
- ocr/*/202*/*.jsonl
- meta/202*/*.jsonl
- config_name: subset_2025
data_files:
- split: train
path:
- ocr/*/2025/*.jsonl
- meta/2025/*.jsonl
- config_name: subset_2010s
data_files:
- split: train
path:
- ocr/*/201*/*.jsonl
- meta/201*/*.jsonl
- config_name: subset_2000s
data_files:
- split: train
path:
- ocr/*/200*/*.jsonl
- meta/200*/*.jsonl
- config_name: subset_1990s
data_files:
- split: train
path:
- ocr/*/199*/*.jsonl
- meta/199*/*.jsonl
- config_name: subset_1980s
data_files:
- split: train
path:
- ocr/*/198*/*.jsonl
- meta/198*/*.jsonl
- config_name: subset_1970s
data_files:
- split: train
path:
- ocr/*/197*/*.jsonl
- meta/197*/*.jsonl
- config_name: subset_1960s
data_files:
- split: train
path:
- ocr/*/196*/*.jsonl
- meta/196*/*.jsonl
- config_name: subset_pre_1960
data_files:
- split: train
path:
- ocr/*/18*/*.jsonl
- ocr/*/19[0-5]*/*.jsonl
- meta/18*/*.jsonl
- meta/19[0-5]*/*.jsonl
- config_name: subset_1885
features:
- name: 'no'
dtype: string
- name: doctitle
dtype: string
- name: bookNo
dtype: string
- name: section
dtype: string
- name: category
dtype: string
- name: publishDate
dtype: string
- name: pageNo
dtype: string
- name: id
dtype: string
- name: pdf_file
dtype: string
splits:
- name: train
num_bytes: 18469
num_examples: 127
download_size: 8481
dataset_size: 18469
- config_name: subset_1886
features:
- name: 'no'
dtype: string
- name: doctitle
dtype: string
- name: bookNo
dtype: string
- name: section
dtype: string
- name: category
dtype: string
- name: publishDate
dtype: string
- name: pageNo
dtype: string
- name: id
dtype: string
- name: pdf_file
dtype: string
splits:
- name: train
num_bytes: 57279
num_examples: 381
download_size: 17352
dataset_size: 57279
- config_name: subset_1887
features:
- name: 'no'
dtype: string
- name: doctitle
dtype: string
- name: bookNo
dtype: string
- name: section
dtype: string
- name: category
dtype: string
- name: publishDate
dtype: string
- name: pageNo
dtype: string
- name: id
dtype: string
- name: pdf_file
dtype: string
splits:
- name: train
num_bytes: 108634
num_examples: 460
download_size: 36105
dataset_size: 108634
- config_name: subset_1888
features:
- name: 'no'
dtype: string
- name: doctitle
dtype: string
- name: bookNo
dtype: string
- name: section
dtype: string
- name: category
dtype: string
- name: publishDate
dtype: string
- name: pageNo
dtype: string
- name: id
dtype: string
- name: pdf_file
dtype: string
splits:
- name: train
num_bytes: 113648
num_examples: 482
download_size: 39047
dataset_size: 113648
- config_name: subset_1889
features:
- name: 'no'
dtype: string
- name: doctitle
dtype: string
- name: bookNo
dtype: string
- name: section
dtype: string
- name: category
dtype: string
- name: publishDate
dtype: string
- name: pageNo
dtype: string
- name: id
dtype: string
- name: pdf_file
dtype: string
splits:
- name: train
num_bytes: 103021
num_examples: 503
download_size: 31583
dataset_size: 103021
- config_name: subset_1890
features:
- name: 'no'
dtype: string
- name: doctitle
dtype: string
- name: bookNo
dtype: string
- name: section
dtype: string
- name: category
dtype: string
- name: publishDate
dtype: string
- name: pageNo
dtype: string
- name: id
dtype: string
- name: pdf_file
dtype: string
splits:
- name: train
num_bytes: 86336
num_examples: 531
download_size: 24582
dataset_size: 86336
- config_name: subset_1891
features:
- name: 'no'
dtype: string
- name: doctitle
dtype: string
- name: bookNo
dtype: string
- name: section
dtype: string
- name: category
dtype: string
- name: publishDate
dtype: string
- name: pageNo
dtype: string
- name: id
dtype: string
- name: pdf_file
dtype: string
splits:
- name: train
num_bytes: 80925
num_examples: 509
download_size: 24518
dataset_size: 80925
- config_name: subset_1892
features:
- name: 'no'
dtype: string
- name: doctitle
dtype: string
- name: bookNo
dtype: string
- name: section
dtype: string
- name: category
dtype: string
- name: publishDate
dtype: string
- name: pageNo
dtype: string
- name: id
dtype: string
- name: pdf_file
dtype: string
splits:
- name: train
num_bytes: 102398
num_examples: 523
download_size: 30350
dataset_size: 102398
- config_name: subset_1893
features:
- name: 'no'
dtype: string
- name: doctitle
dtype: string
- name: bookNo
dtype: string
- name: section
dtype: string
- name: category
dtype: string
- name: publishDate
dtype: string
- name: pageNo
dtype: string
- name: id
dtype: string
- name: pdf_file
dtype: string
splits:
- name: train
num_bytes: 90916
num_examples: 488
download_size: 25213
dataset_size: 90916
- config_name: subset_1894
features:
- name: 'no'
dtype: string
- name: doctitle
dtype: string
- name: bookNo
dtype: string
- name: section
dtype: string
- name: category
dtype: string
- name: publishDate
dtype: string
- name: pageNo
dtype: string
- name: id
dtype: string
- name: pdf_file
dtype: string
splits:
- name: train
num_bytes: 109013
num_examples: 535
download_size: 29815
dataset_size: 109013
- config_name: subset_1895
features:
- name: 'no'
dtype: string
- name: doctitle
dtype: string
- name: bookNo
dtype: string
- name: section
dtype: string
- name: category
dtype: string
- name: publishDate
dtype: string
- name: pageNo
dtype: string
- name: id
dtype: string
- name: pdf_file
dtype: string
splits:
- name: train
num_bytes: 133217
num_examples: 612
download_size: 40287
dataset_size: 133217
- config_name: subset_1896
features:
- name: 'no'
dtype: string
- name: doctitle
dtype: string
- name: bookNo
dtype: string
- name: section
dtype: string
- name: category
dtype: string
- name: publishDate
dtype: string
- name: pageNo
dtype: string
- name: id
dtype: string
- name: pdf_file
dtype: string
splits:
- name: train
num_bytes: 235176
num_examples: 1262
download_size: 40315
dataset_size: 235176
- config_name: subset_1897
features:
- name: 'no'
dtype: string
- name: doctitle
dtype: string
- name: bookNo
dtype: string
- name: section
dtype: string
- name: category
dtype: string
- name: publishDate
dtype: string
- name: pageNo
dtype: string
- name: id
dtype: string
- name: pdf_file
dtype: string
splits:
- name: train
num_bytes: 143461
num_examples: 752
download_size: 36350
dataset_size: 143461
- config_name: subset_1898
features:
- name: 'no'
dtype: string
- name: doctitle
dtype: string
- name: bookNo
dtype: string
- name: section
dtype: string
- name: category
dtype: string
- name: publishDate
dtype: string
- name: pageNo
dtype: string
- name: id
dtype: string
- name: pdf_file
dtype: string
splits:
- name: train
num_bytes: 72538
num_examples: 411
download_size: 24338
dataset_size: 72538
- config_name: subset_1899
features:
- name: 'no'
dtype: string
- name: doctitle
dtype: string
- name: bookNo
dtype: string
- name: section
dtype: string
- name: category
dtype: string
- name: publishDate
dtype: string
- name: pageNo
dtype: string
- name: id
dtype: string
- name: pdf_file
dtype: string
splits:
- name: train
num_bytes: 200856
num_examples: 856
download_size: 54377
dataset_size: 200856
- config_name: subset_1900
features:
- name: 'no'
dtype: string
- name: doctitle
dtype: string
- name: bookNo
dtype: string
- name: section
dtype: string
- name: category
dtype: string
- name: publishDate
dtype: string
- name: pageNo
dtype: string
- name: id
dtype: string
- name: pdf_file
dtype: string
splits:
- name: train
num_bytes: 183554
num_examples: 812
download_size: 55269
dataset_size: 183554
- config_name: subset_1901
features:
- name: 'no'
dtype: string
- name: doctitle
dtype: string
- name: bookNo
dtype: string
- name: section
dtype: string
- name: category
dtype: string
- name: publishDate
dtype: string
- name: pageNo
dtype: string
- name: id
dtype: string
- name: pdf_file
dtype: string
splits:
- name: train
num_bytes: 209093
num_examples: 952
download_size: 50910
dataset_size: 209093
- config_name: subset_1902
features:
- name: 'no'
dtype: string
- name: doctitle
dtype: string
- name: bookNo
dtype: string
- name: section
dtype: string
- name: category
dtype: string
- name: publishDate
dtype: string
- name: pageNo
dtype: string
- name: id
dtype: string
- name: pdf_file
dtype: string
splits:
- name: train
num_bytes: 324905
num_examples: 1029
download_size: 102728
dataset_size: 324905
- config_name: subset_1903
features:
- name: 'no'
dtype: string
- name: doctitle
dtype: string
- name: bookNo
dtype: string
- name: section
dtype: string
- name: category
dtype: string
- name: publishDate
dtype: string
- name: pageNo
dtype: string
- name: id
dtype: string
- name: pdf_file
dtype: string
splits:
- name: train
num_bytes: 410311
num_examples: 1362
download_size: 107352
dataset_size: 410311
- config_name: subset_1904
features:
- name: 'no'
dtype: string
- name: doctitle
dtype: string
- name: bookNo
dtype: string
- name: section
dtype: string
- name: category
dtype: string
- name: publishDate
dtype: string
- name: pageNo
dtype: string
- name: id
dtype: string
- name: pdf_file
dtype: string
splits:
- name: train
num_bytes: 402340
num_examples: 1355
download_size: 100311
dataset_size: 402340
- config_name: subset_1905
features:
- name: 'no'
dtype: string
- name: doctitle
dtype: string
- name: bookNo
dtype: string
- name: section
dtype: string
- name: category
dtype: string
- name: publishDate
dtype: string
- name: pageNo
dtype: string
- name: id
dtype: string
- name: pdf_file
dtype: string
splits:
- name: train
num_bytes: 364887
num_examples: 1345
download_size: 94329
dataset_size: 364887
- config_name: subset_1906
features:
- name: 'no'
dtype: string
- name: doctitle
dtype: string
- name: bookNo
dtype: string
- name: section
dtype: string
- name: category
dtype: string
- name: publishDate
dtype: string
- name: pageNo
dtype: string
- name: id
dtype: string
- name: pdf_file
dtype: string
splits:
- name: train
num_bytes: 512303
num_examples: 1588
download_size: 121881
dataset_size: 512303
- config_name: subset_1907
features:
- name: 'no'
dtype: string
- name: doctitle
dtype: string
- name: bookNo
dtype: string
- name: section
dtype: string
- name: category
dtype: string
- name: publishDate
dtype: string
- name: pageNo
dtype: string
- name: id
dtype: string
- name: pdf_file
dtype: string
splits:
- name: train
num_bytes: 517162
num_examples: 1524
download_size: 126860
dataset_size: 517162
- config_name: subset_1908
features:
- name: 'no'
dtype: string
- name: doctitle
dtype: string
- name: bookNo
dtype: string
- name: section
dtype: string
- name: category
dtype: string
- name: publishDate
dtype: string
- name: pageNo
dtype: string
- name: id
dtype: string
- name: pdf_file
dtype: string
splits:
- name: train
num_bytes: 552090
num_examples: 1530
download_size: 157456
dataset_size: 552090
- config_name: subset_1909
features:
- name: 'no'
dtype: string
- name: doctitle
dtype: string
- name: bookNo
dtype: string
- name: section
dtype: string
- name: category
dtype: string
- name: publishDate
dtype: string
- name: pageNo
dtype: string
- name: id
dtype: string
- name: pdf_file
dtype: string
splits:
- name: train
num_bytes: 562627
num_examples: 1875
download_size: 143952
dataset_size: 562627
- config_name: subset_1910
features:
- name: 'no'
dtype: string
- name: doctitle
dtype: string
- name: bookNo
dtype: string
- name: section
dtype: string
- name: category
dtype: string
- name: publishDate
dtype: string
- name: pageNo
dtype: string
- name: id
dtype: string
- name: pdf_file
dtype: string
splits:
- name: train
num_bytes: 507162
num_examples: 1679
download_size: 136302
dataset_size: 507162
- config_name: subset_1911
features:
- name: 'no'
dtype: string
- name: doctitle
dtype: string
- name: bookNo
dtype: string
- name: section
dtype: string
- name: category
dtype: string
- name: publishDate
dtype: string
- name: pageNo
dtype: string
- name: id
dtype: string
- name: pdf_file
dtype: string
splits:
- name: train
num_bytes: 351182
num_examples: 1455
download_size: 91096
dataset_size: 351182
- config_name: subset_1912
features:
- name: 'no'
dtype: string
- name: doctitle
dtype: string
- name: bookNo
dtype: string
- name: section
dtype: string
- name: category
dtype: string
- name: publishDate
dtype: string
- name: pageNo
dtype: string
- name: id
dtype: string
- name: pdf_file
dtype: string
splits:
- name: train
num_bytes: 464765
num_examples: 1757
download_size: 116329
dataset_size: 464765
- config_name: subset_1913
features:
- name: 'no'
dtype: string
- name: doctitle
dtype: string
- name: bookNo
dtype: string
- name: section
dtype: string
- name: category
dtype: string
- name: publishDate
dtype: string
- name: pageNo
dtype: string
- name: id
dtype: string
- name: pdf_file
dtype: string
splits:
- name: train
num_bytes: 473570
num_examples: 1896
download_size: 112656
dataset_size: 473570
- config_name: subset_1914
features:
- name: 'no'
dtype: string
- name: doctitle
dtype: string
- name: bookNo
dtype: string
- name: section
dtype: string
- name: category
dtype: string
- name: publishDate
dtype: string
- name: pageNo
dtype: string
- name: id
dtype: string
- name: pdf_file
dtype: string
splits:
- name: train
num_bytes: 385095
num_examples: 1714
download_size: 90016
dataset_size: 385095
- config_name: subset_1915
features:
- name: 'no'
dtype: string
- name: doctitle
dtype: string
- name: bookNo
dtype: string
- name: section
dtype: string
- name: category
dtype: string
- name: publishDate
dtype: string
- name: pageNo
dtype: string
- name: id
dtype: string
- name: pdf_file
dtype: string
splits:
- name: train
num_bytes: 425966
num_examples: 1895
download_size: 96006
dataset_size: 425966
- config_name: subset_1916
features:
- name: 'no'
dtype: string
- name: doctitle
dtype: string
- name: bookNo
dtype: string
- name: section
dtype: string
- name: category
dtype: string
- name: publishDate
dtype: string
- name: pageNo
dtype: string
- name: id
dtype: string
- name: pdf_file
dtype: string
splits:
- name: train
num_bytes: 426814
num_examples: 1983
download_size: 91024
dataset_size: 426814
- config_name: subset_1917
features:
- name: 'no'
dtype: string
- name: doctitle
dtype: string
- name: bookNo
dtype: string
- name: section
dtype: string
- name: category
dtype: string
- name: publishDate
dtype: string
- name: pageNo
dtype: string
- name: id
dtype: string
- name: pdf_file
dtype: string
splits:
- name: train
num_bytes: 480079
num_examples: 2112
download_size: 105839
dataset_size: 480079
- config_name: subset_1918
features:
- name: 'no'
dtype: string
- name: doctitle
dtype: string
- name: bookNo
dtype: string
- name: section
dtype: string
- name: category
dtype: string
- name: publishDate
dtype: string
- name: pageNo
dtype: string
- name: id
dtype: string
- name: pdf_file
dtype: string
splits:
- name: train
num_bytes: 549481
num_examples: 2109
download_size: 120434
dataset_size: 549481
- config_name: subset_1919
features:
- name: 'no'
dtype: string
- name: doctitle
dtype: string
- name: bookNo
dtype: string
- name: section
dtype: string
- name: category
dtype: string
- name: publishDate
dtype: string
- name: pageNo
dtype: string
- name: id
dtype: string
- name: pdf_file
dtype: string
splits:
- name: train
num_bytes: 496185
num_examples: 2304
download_size: 90112
dataset_size: 496185
- config_name: subset_1920
features:
- name: 'no'
dtype: string
- name: doctitle
dtype: string
- name: bookNo
dtype: string
- name: section
dtype: string
- name: category
dtype: string
- name: publishDate
dtype: string
- name: pageNo
dtype: string
- name: id
dtype: string
- name: pdf_file
dtype: string
splits:
- name: train
num_bytes: 454606
num_examples: 2075
download_size: 85682
dataset_size: 454606
- config_name: subset_1921
features:
- name: 'no'
dtype: string
- name: doctitle
dtype: string
- name: bookNo
dtype: string
- name: section
dtype: string
- name: category
dtype: string
- name: publishDate
dtype: string
- name: pageNo
dtype: string
- name: id
dtype: string
- name: pdf_file
dtype: string
splits:
- name: train
num_bytes: 477793
num_examples: 2090
download_size: 96782
dataset_size: 477793
- config_name: subset_1922
features:
- name: 'no'
dtype: string
- name: doctitle
dtype: string
- name: bookNo
dtype: string
- name: section
dtype: string
- name: category
dtype: string
- name: publishDate
dtype: string
- name: pageNo
dtype: string
- name: id
dtype: string
- name: pdf_file
dtype: string
splits:
- name: train
num_bytes: 609690
num_examples: 2445
download_size: 127739
dataset_size: 609690
- config_name: subset_1923
features:
- name: 'no'
dtype: string
- name: doctitle
dtype: string
- name: bookNo
dtype: string
- name: section
dtype: string
- name: category
dtype: string
- name: publishDate
dtype: string
- name: pageNo
dtype: string
- name: id
dtype: string
- name: pdf_file
dtype: string
splits:
- name: train
num_bytes: 788414
num_examples: 2628
download_size: 162534
dataset_size: 788414
- config_name: subset_1924
features:
- name: 'no'
dtype: string
- name: doctitle
dtype: string
- name: bookNo
dtype: string
- name: section
dtype: string
- name: category
dtype: string
- name: publishDate
dtype: string
- name: pageNo
dtype: string
- name: id
dtype: string
- name: pdf_file
dtype: string
splits:
- name: train
num_bytes: 591168
num_examples: 2312
download_size: 126117
dataset_size: 591168
- config_name: subset_1925
features:
- name: 'no'
dtype: string
- name: doctitle
dtype: string
- name: bookNo
dtype: string
- name: section
dtype: string
- name: category
dtype: string
- name: publishDate
dtype: string
- name: pageNo
dtype: string
- name: id
dtype: string
- name: pdf_file
dtype: string
splits:
- name: train
num_bytes: 579246
num_examples: 2265
download_size: 119488
dataset_size: 579246
- config_name: subset_1926
features:
- name: 'no'
dtype: string
- name: doctitle
dtype: string
- name: bookNo
dtype: string
- name: section
dtype: string
- name: category
dtype: string
- name: publishDate
dtype: string
- name: pageNo
dtype: string
- name: id
dtype: string
- name: pdf_file
dtype: string
splits:
- name: train
num_bytes: 762313
num_examples: 2721
download_size: 163855
dataset_size: 762313
- config_name: subset_1927
features:
- name: 'no'
dtype: string
- name: doctitle
dtype: string
- name: bookNo
dtype: string
- name: section
dtype: string
- name: category
dtype: string
- name: publishDate
dtype: string
- name: pageNo
dtype: string
- name: id
dtype: string
- name: pdf_file
dtype: string
splits:
- name: train
num_bytes: 726377
num_examples: 2603
download_size: 146041
dataset_size: 726377
- config_name: subset_1928
features:
- name: 'no'
dtype: string
- name: doctitle
dtype: string
- name: bookNo
dtype: string
- name: section
dtype: string
- name: category
dtype: string
- name: publishDate
dtype: string
- name: pageNo
dtype: string
- name: id
dtype: string
- name: pdf_file
dtype: string
splits:
- name: train
num_bytes: 766769
num_examples: 2668
download_size: 154307
dataset_size: 766769
- config_name: subset_1929
features:
- name: 'no'
dtype: string
- name: doctitle
dtype: string
- name: bookNo
dtype: string
- name: section
dtype: string
- name: category
dtype: string
- name: publishDate
dtype: string
- name: pageNo
dtype: string
- name: id
dtype: string
- name: pdf_file
dtype: string
splits:
- name: train
num_bytes: 777029
num_examples: 2713
download_size: 148729
dataset_size: 777029
- config_name: subset_1930
features:
- name: 'no'
dtype: string
- name: doctitle
dtype: string
- name: bookNo
dtype: string
- name: section
dtype: string
- name: category
dtype: string
- name: publishDate
dtype: string
- name: pageNo
dtype: string
- name: id
dtype: string
- name: pdf_file
dtype: string
splits:
- name: train
num_bytes: 687106
num_examples: 2389
download_size: 136124
dataset_size: 687106
- config_name: subset_1931
features:
- name: 'no'
dtype: string
- name: doctitle
dtype: string
- name: bookNo
dtype: string
- name: section
dtype: string
- name: category
dtype: string
- name: publishDate
dtype: string
- name: pageNo
dtype: string
- name: id
dtype: string
- name: pdf_file
dtype: string
splits:
- name: train
num_bytes: 759158
num_examples: 2677
download_size: 140340
dataset_size: 759158
- config_name: subset_1932
features:
- name: 'no'
dtype: string
- name: doctitle
dtype: string
- name: bookNo
dtype: string
- name: section
dtype: string
- name: category
dtype: string
- name: publishDate
dtype: string
- name: pageNo
dtype: string
- name: id
dtype: string
- name: pdf_file
dtype: string
splits:
- name: train
num_bytes: 777098
num_examples: 2613
download_size: 145183
dataset_size: 777098
- config_name: subset_1933
features:
- name: 'no'
dtype: string
- name: doctitle
dtype: string
- name: bookNo
dtype: string
- name: section
dtype: string
- name: category
dtype: string
- name: publishDate
dtype: string
- name: pageNo
dtype: string
- name: id
dtype: string
- name: pdf_file
dtype: string
splits:
- name: train
num_bytes: 689283
num_examples: 2311
download_size: 131107
dataset_size: 689283
- config_name: subset_1934
features:
- name: 'no'
dtype: string
- name: doctitle
dtype: string
- name: bookNo
dtype: string
- name: section
dtype: string
- name: category
dtype: string
- name: publishDate
dtype: string
- name: pageNo
dtype: string
- name: id
dtype: string
- name: pdf_file
dtype: string
splits:
- name: train
num_bytes: 602079
num_examples: 1958
download_size: 124224
dataset_size: 602079
- config_name: subset_1935
features:
- name: 'no'
dtype: string
- name: doctitle
dtype: string
- name: bookNo
dtype: string
- name: section
dtype: string
- name: category
dtype: string
- name: publishDate
dtype: string
- name: pageNo
dtype: string
- name: id
dtype: string
- name: pdf_file
dtype: string
splits:
- name: train
num_bytes: 546643
num_examples: 1754
download_size: 115395
dataset_size: 546643
- config_name: subset_1936
features:
- name: 'no'
dtype: string
- name: doctitle
dtype: string
- name: bookNo
dtype: string
- name: section
dtype: string
- name: category
dtype: string
- name: publishDate
dtype: string
- name: pageNo
dtype: string
- name: id
dtype: string
- name: pdf_file
dtype: string
splits:
- name: train
num_bytes: 545410
num_examples: 1695
download_size: 117835
dataset_size: 545410
- config_name: subset_1937
features:
- name: 'no'
dtype: string
- name: doctitle
dtype: string
- name: bookNo
dtype: string
- name: section
dtype: string
- name: category
dtype: string
- name: publishDate
dtype: string
- name: pageNo
dtype: string
- name: id
dtype: string
- name: pdf_file
dtype: string
splits:
- name: train
num_bytes: 540084
num_examples: 1703
download_size: 113180
dataset_size: 540084
- config_name: subset_1938
features:
- name: 'no'
dtype: string
- name: doctitle
dtype: string
- name: bookNo
dtype: string
- name: section
dtype: string
- name: category
dtype: string
- name: publishDate
dtype: string
- name: pageNo
dtype: string
- name: id
dtype: string
- name: pdf_file
dtype: string
splits:
- name: train
num_bytes: 632906
num_examples: 2074
download_size: 139303
dataset_size: 632906
- config_name: subset_1939
features:
- name: 'no'
dtype: string
- name: doctitle
dtype: string
- name: bookNo
dtype: string
- name: section
dtype: string
- name: category
dtype: string
- name: publishDate
dtype: string
- name: pageNo
dtype: string
- name: id
dtype: string
- name: pdf_file
dtype: string
splits:
- name: train
num_bytes: 679568
num_examples: 2168
download_size: 146973
dataset_size: 679568
- config_name: subset_1940
features:
- name: 'no'
dtype: string
- name: doctitle
dtype: string
- name: bookNo
dtype: string
- name: section
dtype: string
- name: category
dtype: string
- name: publishDate
dtype: string
- name: pageNo
dtype: string
- name: id
dtype: string
- name: pdf_file
dtype: string
splits:
- name: train
num_bytes: 549770
num_examples: 1723
download_size: 119669
dataset_size: 549770
- config_name: subset_1941
features:
- name: 'no'
dtype: string
- name: doctitle
dtype: string
- name: bookNo
dtype: string
- name: section
dtype: string
- name: category
dtype: string
- name: publishDate
dtype: string
- name: pageNo
dtype: string
- name: id
dtype: string
- name: pdf_file
dtype: string
splits:
- name: train
num_bytes: 793410
num_examples: 2516
download_size: 167757
dataset_size: 793410
- config_name: subset_1942
features:
- name: 'no'
dtype: string
- name: doctitle
dtype: string
- name: bookNo
dtype: string
- name: section
dtype: string
- name: category
dtype: string
- name: publishDate
dtype: string
- name: pageNo
dtype: string
- name: id
dtype: string
- name: pdf_file
dtype: string
splits:
- name: train
num_bytes: 632887
num_examples: 2020
download_size: 137020
dataset_size: 632887
- config_name: subset_1943
features:
- name: 'no'
dtype: string
- name: doctitle
dtype: string
- name: bookNo
dtype: string
- name: section
dtype: string
- name: category
dtype: string
- name: publishDate
dtype: string
- name: pageNo
dtype: string
- name: id
dtype: string
- name: pdf_file
dtype: string
splits:
- name: train
num_bytes: 675841
num_examples: 2150
download_size: 145797
dataset_size: 675841
- config_name: subset_1944
features:
- name: 'no'
dtype: string
- name: doctitle
dtype: string
- name: bookNo
dtype: string
- name: section
dtype: string
- name: category
dtype: string
- name: publishDate
dtype: string
- name: pageNo
dtype: string
- name: id
dtype: string
- name: pdf_file
dtype: string
splits:
- name: train
num_bytes: 604697
num_examples: 1786
download_size: 122802
dataset_size: 604697
- config_name: subset_1945
features:
- name: 'no'
dtype: string
- name: doctitle
dtype: string
- name: bookNo
dtype: string
- name: section
dtype: string
- name: category
dtype: string
- name: publishDate
dtype: string
- name: pageNo
dtype: string
- name: id
dtype: string
- name: pdf_file
dtype: string
splits:
- name: train
num_bytes: 516487
num_examples: 1585
download_size: 105871
dataset_size: 516487
- config_name: subset_1946
features:
- name: 'no'
dtype: string
- name: doctitle
dtype: string
- name: bookNo
dtype: string
- name: section
dtype: string
- name: category
dtype: string
- name: publishDate
dtype: string
- name: pageNo
dtype: string
- name: id
dtype: string
- name: pdf_file
dtype: string
splits:
- name: train
num_bytes: 578206
num_examples: 1745
download_size: 118754
dataset_size: 578206
- config_name: subset_1947
features:
- name: 'no'
dtype: string
- name: doctitle
dtype: string
- name: bookNo
dtype: string
- name: section
dtype: string
- name: category
dtype: string
- name: publishDate
dtype: string
- name: pageNo
dtype: string
- name: id
dtype: string
- name: pdf_file
dtype: string
splits:
- name: train
num_bytes: 797086
num_examples: 2405
download_size: 161749
dataset_size: 797086
- config_name: subset_1948
features:
- name: 'no'
dtype: string
- name: doctitle
dtype: string
- name: bookNo
dtype: string
- name: section
dtype: string
- name: category
dtype: string
- name: publishDate
dtype: string
- name: pageNo
dtype: string
- name: id
dtype: string
- name: pdf_file
dtype: string
splits:
- name: train
num_bytes: 1651439
num_examples: 5128
download_size: 283370
dataset_size: 1651439
- config_name: subset_1949
features:
- name: 'no'
dtype: string
- name: doctitle
dtype: string
- name: bookNo
dtype: string
- name: section
dtype: string
- name: category
dtype: string
- name: publishDate
dtype: string
- name: pageNo
dtype: string
- name: id
dtype: string
- name: pdf_file
dtype: string
splits:
- name: train
num_bytes: 1317543
num_examples: 3971
download_size: 253970
dataset_size: 1317543
- config_name: subset_1950
features:
- name: 'no'
dtype: string
- name: doctitle
dtype: string
- name: bookNo
dtype: string
- name: section
dtype: string
- name: category
dtype: string
- name: publishDate
dtype: string
- name: pageNo
dtype: string
- name: id
dtype: string
- name: pdf_file
dtype: string
splits:
- name: train
num_bytes: 1398930
num_examples: 4364
download_size: 270848
dataset_size: 1398930
- config_name: subset_1951
features:
- name: 'no'
dtype: string
- name: doctitle
dtype: string
- name: bookNo
dtype: string
- name: section
dtype: string
- name: category
dtype: string
- name: publishDate
dtype: string
- name: pageNo
dtype: string
- name: id
dtype: string
- name: pdf_file
dtype: string
splits:
- name: train
num_bytes: 1382228
num_examples: 4333
download_size: 264855
dataset_size: 1382228
- config_name: subset_1952
features:
- name: 'no'
dtype: string
- name: doctitle
dtype: string
- name: bookNo
dtype: string
- name: section
dtype: string
- name: category
dtype: string
- name: publishDate
dtype: string
- name: pageNo
dtype: string
- name: id
dtype: string
- name: pdf_file
dtype: string
splits:
- name: train
num_bytes: 1208085
num_examples: 3775
download_size: 236305
dataset_size: 1208085
- config_name: subset_1953
features:
- name: 'no'
dtype: string
- name: doctitle
dtype: string
- name: bookNo
dtype: string
- name: section
dtype: string
- name: category
dtype: string
- name: publishDate
dtype: string
- name: pageNo
dtype: string
- name: id
dtype: string
- name: pdf_file
dtype: string
splits:
- name: train
num_bytes: 1965820
num_examples: 6612
download_size: 341087
dataset_size: 1965820
- config_name: subset_1954
features:
- name: 'no'
dtype: string
- name: doctitle
dtype: string
- name: bookNo
dtype: string
- name: section
dtype: string
- name: category
dtype: string
- name: publishDate
dtype: string
- name: pageNo
dtype: string
- name: id
dtype: string
- name: pdf_file
dtype: string
splits:
- name: train
num_bytes: 1707445
num_examples: 5735
download_size: 290583
dataset_size: 1707445
- config_name: subset_1955
features:
- name: 'no'
dtype: string
- name: doctitle
dtype: string
- name: bookNo
dtype: string
- name: section
dtype: string
- name: category
dtype: string
- name: publishDate
dtype: string
- name: pageNo
dtype: string
- name: id
dtype: string
- name: pdf_file
dtype: string
splits:
- name: train
num_bytes: 1663640
num_examples: 5502
download_size: 289573
dataset_size: 1663640
- config_name: subset_1956
features:
- name: 'no'
dtype: string
- name: doctitle
dtype: string
- name: bookNo
dtype: string
- name: section
dtype: string
- name: category
dtype: string
- name: publishDate
dtype: string
- name: pageNo
dtype: string
- name: id
dtype: string
- name: pdf_file
dtype: string
splits:
- name: train
num_bytes: 2085548
num_examples: 6858
download_size: 357862
dataset_size: 2085548
- config_name: subset_1957
features:
- name: 'no'
dtype: string
- name: doctitle
dtype: string
- name: bookNo
dtype: string
- name: section
dtype: string
- name: category
dtype: string
- name: publishDate
dtype: string
- name: pageNo
dtype: string
- name: id
dtype: string
- name: pdf_file
dtype: string
splits:
- name: train
num_bytes: 2032705
num_examples: 6438
download_size: 340481
dataset_size: 2032705
- config_name: subset_1958
features:
- name: 'no'
dtype: string
- name: doctitle
dtype: string
- name: bookNo
dtype: string
- name: section
dtype: string
- name: category
dtype: string
- name: publishDate
dtype: string
- name: pageNo
dtype: string
- name: id
dtype: string
- name: pdf_file
dtype: string
splits:
- name: train
num_bytes: 1741448
num_examples: 5411
download_size: 294439
dataset_size: 1741448
- config_name: subset_1959
features:
- name: 'no'
dtype: string
- name: doctitle
dtype: string
- name: bookNo
dtype: string
- name: section
dtype: string
- name: category
dtype: string
- name: publishDate
dtype: string
- name: pageNo
dtype: string
- name: id
dtype: string
- name: pdf_file
dtype: string
splits:
- name: train
num_bytes: 2224867
num_examples: 6362
download_size: 324903
dataset_size: 2224867
- config_name: subset_1960
features:
- name: 'no'
dtype: string
- name: doctitle
dtype: string
- name: bookNo
dtype: string
- name: section
dtype: string
- name: category
dtype: string
- name: publishDate
dtype: string
- name: pageNo
dtype: string
- name: id
dtype: string
- name: pdf_file
dtype: string
splits:
- name: train
num_bytes: 1867181
num_examples: 5612
download_size: 295113
dataset_size: 1867181
- config_name: subset_1961
features:
- name: 'no'
dtype: string
- name: doctitle
dtype: string
- name: bookNo
dtype: string
- name: section
dtype: string
- name: category
dtype: string
- name: publishDate
dtype: string
- name: pageNo
dtype: string
- name: id
dtype: string
- name: pdf_file
dtype: string
splits:
- name: train
num_bytes: 2033098
num_examples: 5746
download_size: 307454
dataset_size: 2033098
- config_name: subset_1962
features:
- name: 'no'
dtype: string
- name: doctitle
dtype: string
- name: bookNo
dtype: string
- name: section
dtype: string
- name: category
dtype: string
- name: publishDate
dtype: string
- name: pageNo
dtype: string
- name: id
dtype: string
- name: pdf_file
dtype: string
splits:
- name: train
num_bytes: 2267474
num_examples: 6380
download_size: 335049
dataset_size: 2267474
- config_name: subset_1963
features:
- name: 'no'
dtype: string
- name: doctitle
dtype: string
- name: bookNo
dtype: string
- name: section
dtype: string
- name: category
dtype: string
- name: publishDate
dtype: string
- name: pageNo
dtype: string
- name: id
dtype: string
- name: pdf_file
dtype: string
splits:
- name: train
num_bytes: 2554831
num_examples: 7296
download_size: 366084
dataset_size: 2554831
- config_name: subset_1964
features:
- name: 'no'
dtype: string
- name: doctitle
dtype: string
- name: bookNo
dtype: string
- name: section
dtype: string
- name: category
dtype: string
- name: publishDate
dtype: string
- name: pageNo
dtype: string
- name: id
dtype: string
- name: pdf_file
dtype: string
splits:
- name: train
num_bytes: 1036564
num_examples: 2467
download_size: 189301
dataset_size: 1036564
- config_name: subset_1965
features:
- name: 'no'
dtype: string
- name: doctitle
dtype: string
- name: bookNo
dtype: string
- name: section
dtype: string
- name: category
dtype: string
- name: publishDate
dtype: string
- name: pageNo
dtype: string
- name: id
dtype: string
- name: pdf_file
dtype: string
splits:
- name: train
num_bytes: 1154704
num_examples: 2686
download_size: 208741
dataset_size: 1154704
- config_name: subset_1966
features:
- name: 'no'
dtype: string
- name: doctitle
dtype: string
- name: bookNo
dtype: string
- name: section
dtype: string
- name: category
dtype: string
- name: publishDate
dtype: string
- name: pageNo
dtype: string
- name: id
dtype: string
- name: pdf_file
dtype: string
splits:
- name: train
num_bytes: 1271265
num_examples: 2955
download_size: 230876
dataset_size: 1271265
- config_name: subset_1967
features:
- name: 'no'
dtype: string
- name: doctitle
dtype: string
- name: bookNo
dtype: string
- name: section
dtype: string
- name: category
dtype: string
- name: publishDate
dtype: string
- name: pageNo
dtype: string
- name: id
dtype: string
- name: pdf_file
dtype: string
splits:
- name: train
num_bytes: 1476882
num_examples: 3613
download_size: 269897
dataset_size: 1476882
- config_name: subset_1968
features:
- name: 'no'
dtype: string
- name: doctitle
dtype: string
- name: bookNo
dtype: string
- name: section
dtype: string
- name: category
dtype: string
- name: publishDate
dtype: string
- name: pageNo
dtype: string
- name: id
dtype: string
- name: pdf_file
dtype: string
splits:
- name: train
num_bytes: 3012592
num_examples: 9399
download_size: 468049
dataset_size: 3012592
- config_name: subset_1969
features:
- name: 'no'
dtype: string
- name: doctitle
dtype: string
- name: bookNo
dtype: string
- name: section
dtype: string
- name: category
dtype: string
- name: publishDate
dtype: string
- name: pageNo
dtype: string
- name: id
dtype: string
- name: pdf_file
dtype: string
splits:
- name: train
num_bytes: 4070251
num_examples: 13363
download_size: 572374
dataset_size: 4070251
- config_name: subset_1970
features:
- name: 'no'
dtype: string
- name: doctitle
dtype: string
- name: bookNo
dtype: string
- name: section
dtype: string
- name: category
dtype: string
- name: publishDate
dtype: string
- name: pageNo
dtype: string
- name: id
dtype: string
- name: pdf_file
dtype: string
splits:
- name: train
num_bytes: 4202209
num_examples: 13729
download_size: 591266
dataset_size: 4202209
- config_name: subset_1971
features:
- name: 'no'
dtype: string
- name: doctitle
dtype: string
- name: bookNo
dtype: string
- name: section
dtype: string
- name: category
dtype: string
- name: publishDate
dtype: string
- name: pageNo
dtype: string
- name: id
dtype: string
- name: pdf_file
dtype: string
splits:
- name: train
num_bytes: 3170103
num_examples: 9846
download_size: 477912
dataset_size: 3170103
- config_name: subset_1972
features:
- name: 'no'
dtype: string
- name: doctitle
dtype: string
- name: bookNo
dtype: string
- name: section
dtype: string
- name: category
dtype: string
- name: publishDate
dtype: string
- name: pageNo
dtype: string
- name: id
dtype: string
- name: pdf_file
dtype: string
splits:
- name: train
num_bytes: 2050856
num_examples: 4941
download_size: 371366
dataset_size: 2050856
- config_name: subset_1973
features:
- name: 'no'
dtype: string
- name: doctitle
dtype: string
- name: bookNo
dtype: string
- name: section
dtype: string
- name: category
dtype: string
- name: publishDate
dtype: string
- name: pageNo
dtype: string
- name: id
dtype: string
- name: pdf_file
dtype: string
splits:
- name: train
num_bytes: 3541335
num_examples: 10270
download_size: 553169
dataset_size: 3541335
- config_name: subset_1974
features:
- name: 'no'
dtype: string
- name: doctitle
dtype: string
- name: bookNo
dtype: string
- name: section
dtype: string
- name: category
dtype: string
- name: publishDate
dtype: string
- name: pageNo
dtype: string
- name: id
dtype: string
- name: pdf_file
dtype: string
splits:
- name: train
num_bytes: 2551251
num_examples: 5321
download_size: 444621
dataset_size: 2551251
- config_name: subset_1975
features:
- name: 'no'
dtype: string
- name: doctitle
dtype: string
- name: bookNo
dtype: string
- name: section
dtype: string
- name: category
dtype: string
- name: publishDate
dtype: string
- name: pageNo
dtype: string
- name: id
dtype: string
- name: pdf_file
dtype: string
splits:
- name: train
num_bytes: 2231811
num_examples: 4871
download_size: 393133
dataset_size: 2231811
- config_name: subset_1976
features:
- name: 'no'
dtype: string
- name: doctitle
dtype: string
- name: bookNo
dtype: string
- name: section
dtype: string
- name: category
dtype: string
- name: publishDate
dtype: string
- name: pageNo
dtype: string
- name: id
dtype: string
- name: pdf_file
dtype: string
splits:
- name: train
num_bytes: 10572256
num_examples: 34924
download_size: 1344469
dataset_size: 10572256
- config_name: subset_1977
features:
- name: 'no'
dtype: string
- name: doctitle
dtype: string
- name: bookNo
dtype: string
- name: section
dtype: string
- name: category
dtype: string
- name: publishDate
dtype: string
- name: pageNo
dtype: string
- name: id
dtype: string
- name: pdf_file
dtype: string
splits:
- name: train
num_bytes: 7831947
num_examples: 24897
download_size: 999551
dataset_size: 7831947
- config_name: subset_1978
features:
- name: 'no'
dtype: string
- name: doctitle
dtype: string
- name: bookNo
dtype: string
- name: section
dtype: string
- name: category
dtype: string
- name: publishDate
dtype: string
- name: pageNo
dtype: string
- name: id
dtype: string
- name: pdf_file
dtype: string
splits:
- name: train
num_bytes: 9260121
num_examples: 30375
download_size: 1156082
dataset_size: 9260121
- config_name: subset_1979
features:
- name: 'no'
dtype: string
- name: doctitle
dtype: string
- name: bookNo
dtype: string
- name: section
dtype: string
- name: category
dtype: string
- name: publishDate
dtype: string
- name: pageNo
dtype: string
- name: id
dtype: string
- name: pdf_file
dtype: string
splits:
- name: train
num_bytes: 8481149
num_examples: 26862
download_size: 1100808
dataset_size: 8481149
- config_name: subset_1980
features:
- name: 'no'
dtype: string
- name: doctitle
dtype: string
- name: bookNo
dtype: string
- name: section
dtype: string
- name: category
dtype: string
- name: publishDate
dtype: string
- name: pageNo
dtype: string
- name: id
dtype: string
- name: pdf_file
dtype: string
splits:
- name: train
num_bytes: 2412078
num_examples: 5115
download_size: 423082
dataset_size: 2412078
- config_name: subset_1981
features:
- name: 'no'
dtype: string
- name: doctitle
dtype: string
- name: bookNo
dtype: string
- name: section
dtype: string
- name: category
dtype: string
- name: publishDate
dtype: string
- name: pageNo
dtype: string
- name: id
dtype: string
- name: pdf_file
dtype: string
splits:
- name: train
num_bytes: 2724043
num_examples: 5615
download_size: 507198
dataset_size: 2724043
- config_name: subset_1982
features:
- name: 'no'
dtype: string
- name: doctitle
dtype: string
- name: bookNo
dtype: string
- name: section
dtype: string
- name: category
dtype: string
- name: publishDate
dtype: string
- name: pageNo
dtype: string
- name: id
dtype: string
- name: pdf_file
dtype: string
splits:
- name: train
num_bytes: 2592421
num_examples: 5541
download_size: 466262
dataset_size: 2592421
- config_name: subset_1983
features:
- name: 'no'
dtype: string
- name: doctitle
dtype: string
- name: bookNo
dtype: string
- name: section
dtype: string
- name: category
dtype: string
- name: publishDate
dtype: string
- name: pageNo
dtype: string
- name: id
dtype: string
- name: pdf_file
dtype: string
splits:
- name: train
num_bytes: 2442938
num_examples: 4933
download_size: 433519
dataset_size: 2442938
- config_name: subset_1984
features:
- name: 'no'
dtype: string
- name: doctitle
dtype: string
- name: bookNo
dtype: string
- name: section
dtype: string
- name: category
dtype: string
- name: publishDate
dtype: string
- name: pageNo
dtype: string
- name: id
dtype: string
- name: pdf_file
dtype: string
splits:
- name: train
num_bytes: 2548665
num_examples: 5101
download_size: 449883
dataset_size: 2548665
- config_name: subset_1985
features:
- name: 'no'
dtype: string
- name: doctitle
dtype: string
- name: bookNo
dtype: string
- name: section
dtype: string
- name: category
dtype: string
- name: publishDate
dtype: string
- name: pageNo
dtype: string
- name: id
dtype: string
- name: pdf_file
dtype: string
splits:
- name: train
num_bytes: 2968620
num_examples: 6247
download_size: 567367
dataset_size: 2968620
- config_name: subset_1986
features:
- name: 'no'
dtype: string
- name: doctitle
dtype: string
- name: bookNo
dtype: string
- name: section
dtype: string
- name: category
dtype: string
- name: publishDate
dtype: string
- name: pageNo
dtype: string
- name: id
dtype: string
- name: pdf_file
dtype: string
splits:
- name: train
num_bytes: 3264866
num_examples: 7076
download_size: 596752
dataset_size: 3264866
- config_name: subset_1987
features:
- name: 'no'
dtype: string
- name: doctitle
dtype: string
- name: bookNo
dtype: string
- name: section
dtype: string
- name: category
dtype: string
- name: publishDate
dtype: string
- name: pageNo
dtype: string
- name: id
dtype: string
- name: pdf_file
dtype: string
splits:
- name: train
num_bytes: 3624486
num_examples: 7722
download_size: 677504
dataset_size: 3624486
- config_name: subset_1988
features:
- name: 'no'
dtype: string
- name: doctitle
dtype: string
- name: bookNo
dtype: string
- name: section
dtype: string
- name: category
dtype: string
- name: publishDate
dtype: string
- name: pageNo
dtype: string
- name: id
dtype: string
- name: pdf_file
dtype: string
splits:
- name: train
num_bytes: 3757236
num_examples: 8741
download_size: 686791
dataset_size: 3757236
- config_name: subset_1989
features:
- name: 'no'
dtype: string
- name: doctitle
dtype: string
- name: bookNo
dtype: string
- name: section
dtype: string
- name: category
dtype: string
- name: publishDate
dtype: string
- name: pageNo
dtype: string
- name: id
dtype: string
- name: pdf_file
dtype: string
splits:
- name: train
num_bytes: 3549160
num_examples: 7961
download_size: 641539
dataset_size: 3549160
- config_name: subset_1990
features:
- name: 'no'
dtype: string
- name: doctitle
dtype: string
- name: bookNo
dtype: string
- name: section
dtype: string
- name: category
dtype: string
- name: publishDate
dtype: string
- name: pageNo
dtype: string
- name: id
dtype: string
- name: pdf_file
dtype: string
splits:
- name: train
num_bytes: 4080286
num_examples: 9735
download_size: 705163
dataset_size: 4080286
- config_name: subset_1991
features:
- name: 'no'
dtype: string
- name: doctitle
dtype: string
- name: bookNo
dtype: string
- name: section
dtype: string
- name: category
dtype: string
- name: publishDate
dtype: string
- name: pageNo
dtype: string
- name: id
dtype: string
- name: pdf_file
dtype: string
splits:
- name: train
num_bytes: 5007383
num_examples: 11599
download_size: 897578
dataset_size: 5007383
- config_name: subset_1992
features:
- name: 'no'
dtype: string
- name: doctitle
dtype: string
- name: bookNo
dtype: string
- name: section
dtype: string
- name: category
dtype: string
- name: publishDate
dtype: string
- name: pageNo
dtype: string
- name: id
dtype: string
- name: pdf_file
dtype: string
splits:
- name: train
num_bytes: 4717907
num_examples: 10565
download_size: 1091000
dataset_size: 4717907
- config_name: subset_1993
features:
- name: 'no'
dtype: string
- name: doctitle
dtype: string
- name: bookNo
dtype: string
- name: section
dtype: string
- name: category
dtype: string
- name: publishDate
dtype: string
- name: pageNo
dtype: string
- name: id
dtype: string
- name: pdf_file
dtype: string
splits:
- name: train
num_bytes: 4176604
num_examples: 9274
download_size: 938150
dataset_size: 4176604
- config_name: subset_1994
features:
- name: 'no'
dtype: string
- name: doctitle
dtype: string
- name: bookNo
dtype: string
- name: section
dtype: string
- name: category
dtype: string
- name: publishDate
dtype: string
- name: pageNo
dtype: string
- name: id
dtype: string
- name: pdf_file
dtype: string
splits:
- name: train
num_bytes: 4595367
num_examples: 10177
download_size: 1027817
dataset_size: 4595367
- config_name: subset_1995
features:
- name: 'no'
dtype: string
- name: doctitle
dtype: string
- name: bookNo
dtype: string
- name: section
dtype: string
- name: category
dtype: string
- name: publishDate
dtype: string
- name: pageNo
dtype: string
- name: id
dtype: string
- name: pdf_file
dtype: string
splits:
- name: train
num_bytes: 4233938
num_examples: 9417
download_size: 961467
dataset_size: 4233938
- config_name: subset_1996
features:
- name: 'no'
dtype: string
- name: doctitle
dtype: string
- name: bookNo
dtype: string
- name: section
dtype: string
- name: category
dtype: string
- name: publishDate
dtype: string
- name: pageNo
dtype: string
- name: id
dtype: string
- name: pdf_file
dtype: string
splits:
- name: train
num_bytes: 8295192
num_examples: 18274
download_size: 1360513
dataset_size: 8295192
- config_name: subset_1997
features:
- name: 'no'
dtype: string
- name: doctitle
dtype: string
- name: bookNo
dtype: string
- name: section
dtype: string
- name: category
dtype: string
- name: publishDate
dtype: string
- name: pageNo
dtype: string
- name: id
dtype: string
- name: pdf_file
dtype: string
splits:
- name: train
num_bytes: 3502120
num_examples: 7432
download_size: 692666
dataset_size: 3502120
- config_name: subset_1998
features:
- name: 'no'
dtype: string
- name: doctitle
dtype: string
- name: bookNo
dtype: string
- name: section
dtype: string
- name: category
dtype: string
- name: publishDate
dtype: string
- name: pageNo
dtype: string
- name: id
dtype: string
- name: pdf_file
dtype: string
splits:
- name: train
num_bytes: 7461269
num_examples: 16365
download_size: 1420879
dataset_size: 7461269
- config_name: subset_1999
features:
- name: 'no'
dtype: string
- name: doctitle
dtype: string
- name: bookNo
dtype: string
- name: section
dtype: string
- name: category
dtype: string
- name: publishDate
dtype: string
- name: pageNo
dtype: string
- name: id
dtype: string
- name: pdf_file
dtype: string
splits:
- name: train
num_bytes: 8490147
num_examples: 18473
download_size: 1482647
dataset_size: 8490147
- config_name: subset_2000
features:
- name: 'no'
dtype: string
- name: doctitle
dtype: string
- name: bookNo
dtype: string
- name: section
dtype: string
- name: category
dtype: string
- name: publishDate
dtype: string
- name: pageNo
dtype: string
- name: id
dtype: string
- name: pdf_file
dtype: string
splits:
- name: train
num_bytes: 4906475
num_examples: 10537
download_size: 1109040
dataset_size: 4906475
- config_name: subset_2001
features:
- name: 'no'
dtype: string
- name: doctitle
dtype: string
- name: bookNo
dtype: string
- name: section
dtype: string
- name: category
dtype: string
- name: publishDate
dtype: string
- name: pageNo
dtype: string
- name: id
dtype: string
- name: pdf_file
dtype: string
splits:
- name: train
num_bytes: 4728867
num_examples: 10097
download_size: 1064204
dataset_size: 4728867
- config_name: subset_2002
features:
- name: 'no'
dtype: string
- name: doctitle
dtype: string
- name: bookNo
dtype: string
- name: section
dtype: string
- name: category
dtype: string
- name: publishDate
dtype: string
- name: pageNo
dtype: string
- name: id
dtype: string
- name: pdf_file
dtype: string
splits:
- name: train
num_bytes: 3238608
num_examples: 7163
download_size: 723074
dataset_size: 3238608
- config_name: subset_2003
features:
- name: 'no'
dtype: string
- name: doctitle
dtype: string
- name: bookNo
dtype: string
- name: section
dtype: string
- name: category
dtype: string
- name: publishDate
dtype: string
- name: pageNo
dtype: string
- name: id
dtype: string
- name: pdf_file
dtype: string
splits:
- name: train
num_bytes: 8077505
num_examples: 18319
download_size: 1589318
dataset_size: 8077505
- config_name: subset_2004
features:
- name: 'no'
dtype: string
- name: doctitle
dtype: string
- name: bookNo
dtype: string
- name: section
dtype: string
- name: category
dtype: string
- name: publishDate
dtype: string
- name: pageNo
dtype: string
- name: id
dtype: string
- name: pdf_file
dtype: string
splits:
- name: train
num_bytes: 9928020
num_examples: 21210
download_size: 1976352
dataset_size: 9928020
- config_name: subset_2005
features:
- name: 'no'
dtype: string
- name: doctitle
dtype: string
- name: bookNo
dtype: string
- name: section
dtype: string
- name: category
dtype: string
- name: publishDate
dtype: string
- name: pageNo
dtype: string
- name: id
dtype: string
- name: pdf_file
dtype: string
splits:
- name: train
num_bytes: 9604924
num_examples: 20182
download_size: 1933897
dataset_size: 9604924
- config_name: subset_2006
features:
- name: 'no'
dtype: string
- name: doctitle
dtype: string
- name: bookNo
dtype: string
- name: section
dtype: string
- name: category
dtype: string
- name: publishDate
dtype: string
- name: pageNo
dtype: string
- name: id
dtype: string
- name: pdf_file
dtype: string
splits:
- name: train
num_bytes: 12119194
num_examples: 25403
download_size: 2350890
dataset_size: 12119194
- config_name: subset_2007
features:
- name: 'no'
dtype: string
- name: doctitle
dtype: string
- name: bookNo
dtype: string
- name: section
dtype: string
- name: category
dtype: string
- name: publishDate
dtype: string
- name: pageNo
dtype: string
- name: id
dtype: string
- name: pdf_file
dtype: string
splits:
- name: train
num_bytes: 13855094
num_examples: 28325
download_size: 2646104
dataset_size: 13855094
- config_name: subset_2008
features:
- name: 'no'
dtype: string
- name: doctitle
dtype: string
- name: bookNo
dtype: string
- name: section
dtype: string
- name: category
dtype: string
- name: publishDate
dtype: string
- name: pageNo
dtype: string
- name: id
dtype: string
- name: pdf_file
dtype: string
splits:
- name: train
num_bytes: 13073542
num_examples: 25972
download_size: 2488286
dataset_size: 13073542
- config_name: subset_2009
features:
- name: 'no'
dtype: string
- name: doctitle
dtype: string
- name: bookNo
dtype: string
- name: section
dtype: string
- name: category
dtype: string
- name: publishDate
dtype: string
- name: pageNo
dtype: string
- name: id
dtype: string
- name: pdf_file
dtype: string
splits:
- name: train
num_bytes: 21295325
num_examples: 42172
download_size: 3935718
dataset_size: 21295325
- config_name: subset_2010
features:
- name: 'no'
dtype: string
- name: doctitle
dtype: string
- name: bookNo
dtype: string
- name: section
dtype: string
- name: category
dtype: string
- name: publishDate
dtype: string
- name: pageNo
dtype: string
- name: id
dtype: string
- name: pdf_file
dtype: string
splits:
- name: train
num_bytes: 24862822
num_examples: 47593
download_size: 4625114
dataset_size: 24862822
- config_name: subset_2011
features:
- name: 'no'
dtype: string
- name: doctitle
dtype: string
- name: bookNo
dtype: string
- name: section
dtype: string
- name: category
dtype: string
- name: publishDate
dtype: string
- name: pageNo
dtype: string
- name: id
dtype: string
- name: pdf_file
dtype: string
splits:
- name: train
num_bytes: 24543887
num_examples: 46663
download_size: 4541784
dataset_size: 24543887
- config_name: subset_2012
features:
- name: 'no'
dtype: string
- name: doctitle
dtype: string
- name: bookNo
dtype: string
- name: section
dtype: string
- name: category
dtype: string
- name: publishDate
dtype: string
- name: pageNo
dtype: string
- name: id
dtype: string
- name: pdf_file
dtype: string
splits:
- name: train
num_bytes: 22003412
num_examples: 42031
download_size: 4148097
dataset_size: 22003412
- config_name: subset_2013
features:
- name: 'no'
dtype: string
- name: doctitle
dtype: string
- name: bookNo
dtype: string
- name: section
dtype: string
- name: category
dtype: string
- name: publishDate
dtype: string
- name: pageNo
dtype: string
- name: id
dtype: string
- name: pdf_file
dtype: string
splits:
- name: train
num_bytes: 19215182
num_examples: 37150
download_size: 3617787
dataset_size: 19215182
- config_name: subset_2014
features:
- name: 'no'
dtype: string
- name: doctitle
dtype: string
- name: bookNo
dtype: string
- name: section
dtype: string
- name: category
dtype: string
- name: publishDate
dtype: string
- name: pageNo
dtype: string
- name: id
dtype: string
- name: pdf_file
dtype: string
splits:
- name: train
num_bytes: 17804686
num_examples: 34392
download_size: 3317788
dataset_size: 17804686
- config_name: subset_2015
features:
- name: 'no'
dtype: string
- name: doctitle
dtype: string
- name: bookNo
dtype: string
- name: section
dtype: string
- name: category
dtype: string
- name: publishDate
dtype: string
- name: pageNo
dtype: string
- name: id
dtype: string
- name: pdf_file
dtype: string
splits:
- name: train
num_bytes: 16031162
num_examples: 31451
download_size: 3079481
dataset_size: 16031162
- config_name: subset_2016
features:
- name: 'no'
dtype: string
- name: doctitle
dtype: string
- name: bookNo
dtype: string
- name: section
dtype: string
- name: category
dtype: string
- name: publishDate
dtype: string
- name: pageNo
dtype: string
- name: id
dtype: string
- name: pdf_file
dtype: string
splits:
- name: train
num_bytes: 16192602
num_examples: 31485
download_size: 3050143
dataset_size: 16192602
- config_name: subset_2017
features:
- name: 'no'
dtype: string
- name: doctitle
dtype: string
- name: bookNo
dtype: string
- name: section
dtype: string
- name: category
dtype: string
- name: publishDate
dtype: string
- name: pageNo
dtype: string
- name: id
dtype: string
- name: pdf_file
dtype: string
splits:
- name: train
num_bytes: 15634436
num_examples: 30502
download_size: 2968209
dataset_size: 15634436
- config_name: subset_2018
features:
- name: 'no'
dtype: string
- name: doctitle
dtype: string
- name: bookNo
dtype: string
- name: section
dtype: string
- name: category
dtype: string
- name: publishDate
dtype: string
- name: pageNo
dtype: string
- name: id
dtype: string
- name: pdf_file
dtype: string
splits:
- name: train
num_bytes: 13053086
num_examples: 26039
download_size: 2627728
dataset_size: 13053086
- config_name: subset_2019
features:
- name: 'no'
dtype: string
- name: doctitle
dtype: string
- name: bookNo
dtype: string
- name: section
dtype: string
- name: category
dtype: string
- name: publishDate
dtype: string
- name: pageNo
dtype: string
- name: id
dtype: string
- name: pdf_file
dtype: string
splits:
- name: train
num_bytes: 13326104
num_examples: 27188
download_size: 2753786
dataset_size: 13326104
- config_name: subset_2020
features:
- name: 'no'
dtype: string
- name: doctitle
dtype: string
- name: bookNo
dtype: string
- name: section
dtype: string
- name: category
dtype: string
- name: publishDate
dtype: string
- name: pageNo
dtype: string
- name: id
dtype: string
- name: pdf_file
dtype: string
splits:
- name: train
num_bytes: 15110444
num_examples: 30471
download_size: 3069934
dataset_size: 15110444
- config_name: subset_2021
features:
- name: 'no'
dtype: string
- name: doctitle
dtype: string
- name: bookNo
dtype: string
- name: section
dtype: string
- name: category
dtype: string
- name: publishDate
dtype: string
- name: pageNo
dtype: string
- name: id
dtype: string
- name: pdf_file
dtype: string
splits:
- name: train
num_bytes: 15679744
num_examples: 31356
download_size: 2985298
dataset_size: 15679744
- config_name: subset_2022
features:
- name: 'no'
dtype: string
- name: doctitle
dtype: string
- name: bookNo
dtype: string
- name: section
dtype: string
- name: category
dtype: string
- name: publishDate
dtype: string
- name: pageNo
dtype: string
- name: id
dtype: string
- name: pdf_file
dtype: string
splits:
- name: train
num_bytes: 15808618
num_examples: 31663
download_size: 3129519
dataset_size: 15808618
- config_name: subset_2023
features:
- name: 'no'
dtype: string
- name: doctitle
dtype: string
- name: bookNo
dtype: string
- name: section
dtype: string
- name: category
dtype: string
- name: publishDate
dtype: string
- name: pageNo
dtype: string
- name: id
dtype: string
- name: pdf_file
dtype: string
splits:
- name: train
num_bytes: 17383356
num_examples: 35303
download_size: 3405665
dataset_size: 17383356
- config_name: subset_2024
features:
- name: 'no'
dtype: string
- name: doctitle
dtype: string
- name: bookNo
dtype: string
- name: section
dtype: string
- name: category
dtype: string
- name: publishDate
dtype: string
- name: pageNo
dtype: string
- name: id
dtype: string
- name: pdf_file
dtype: string
splits:
- name: train
num_bytes: 18636934
num_examples: 38864
download_size: 3760771
dataset_size: 18636934
- config_name: subset_2025
features:
- name: 'no'
dtype: string
- name: doctitle
dtype: string
- name: bookNo
dtype: string
- name: section
dtype: string
- name: category
dtype: string
- name: publishDate
dtype: string
- name: pageNo
dtype: string
- name: id
dtype: string
- name: pdf_file
dtype: string
- name: text
dtype: string
splits:
- name: train
num_bytes: 73817047
num_examples: 48833
download_size: 15604903
dataset_size: 73817047
configs:
- config_name: subset_1885
data_files:
- split: train
path: subset_1885/train-*
- config_name: subset_1886
data_files:
- split: train
path: subset_1886/train-*
- config_name: subset_1887
data_files:
- split: train
path: subset_1887/train-*
- config_name: subset_1888
data_files:
- split: train
path: subset_1888/train-*
- config_name: subset_1889
data_files:
- split: train
path: subset_1889/train-*
- config_name: subset_1890
data_files:
- split: train
path: subset_1890/train-*
- config_name: subset_1891
data_files:
- split: train
path: subset_1891/train-*
- config_name: subset_1892
data_files:
- split: train
path: subset_1892/train-*
- config_name: subset_1893
data_files:
- split: train
path: subset_1893/train-*
- config_name: subset_1894
data_files:
- split: train
path: subset_1894/train-*
- config_name: subset_1895
data_files:
- split: train
path: subset_1895/train-*
- config_name: subset_1896
data_files:
- split: train
path: subset_1896/train-*
- config_name: subset_1897
data_files:
- split: train
path: subset_1897/train-*
- config_name: subset_1898
data_files:
- split: train
path: subset_1898/train-*
- config_name: subset_1899
data_files:
- split: train
path: subset_1899/train-*
- config_name: subset_1900
data_files:
- split: train
path: subset_1900/train-*
- config_name: subset_1901
data_files:
- split: train
path: subset_1901/train-*
- config_name: subset_1902
data_files:
- split: train
path: subset_1902/train-*
- config_name: subset_1903
data_files:
- split: train
path: subset_1903/train-*
- config_name: subset_1904
data_files:
- split: train
path: subset_1904/train-*
- config_name: subset_1905
data_files:
- split: train
path: subset_1905/train-*
- config_name: subset_1906
data_files:
- split: train
path: subset_1906/train-*
- config_name: subset_1907
data_files:
- split: train
path: subset_1907/train-*
- config_name: subset_1908
data_files:
- split: train
path: subset_1908/train-*
- config_name: subset_1909
data_files:
- split: train
path: subset_1909/train-*
- config_name: subset_1910
data_files:
- split: train
path: subset_1910/train-*
- config_name: subset_1911
data_files:
- split: train
path: subset_1911/train-*
- config_name: subset_1912
data_files:
- split: train
path: subset_1912/train-*
- config_name: subset_1913
data_files:
- split: train
path: subset_1913/train-*
- config_name: subset_1914
data_files:
- split: train
path: subset_1914/train-*
- config_name: subset_1915
data_files:
- split: train
path: subset_1915/train-*
- config_name: subset_1916
data_files:
- split: train
path: subset_1916/train-*
- config_name: subset_1917
data_files:
- split: train
path: subset_1917/train-*
- config_name: subset_1918
data_files:
- split: train
path: subset_1918/train-*
- config_name: subset_1919
data_files:
- split: train
path: subset_1919/train-*
- config_name: subset_1920
data_files:
- split: train
path: subset_1920/train-*
- config_name: subset_1921
data_files:
- split: train
path: subset_1921/train-*
- config_name: subset_1922
data_files:
- split: train
path: subset_1922/train-*
- config_name: subset_1923
data_files:
- split: train
path: subset_1923/train-*
- config_name: subset_1924
data_files:
- split: train
path: subset_1924/train-*
- config_name: subset_1925
data_files:
- split: train
path: subset_1925/train-*
- config_name: subset_1926
data_files:
- split: train
path: subset_1926/train-*
- config_name: subset_1927
data_files:
- split: train
path: subset_1927/train-*
- config_name: subset_1928
data_files:
- split: train
path: subset_1928/train-*
- config_name: subset_1929
data_files:
- split: train
path: subset_1929/train-*
- config_name: subset_1930
data_files:
- split: train
path: subset_1930/train-*
- config_name: subset_1931
data_files:
- split: train
path: subset_1931/train-*
- config_name: subset_1932
data_files:
- split: train
path: subset_1932/train-*
- config_name: subset_1933
data_files:
- split: train
path: subset_1933/train-*
- config_name: subset_1934
data_files:
- split: train
path: subset_1934/train-*
- config_name: subset_1935
data_files:
- split: train
path: subset_1935/train-*
- config_name: subset_1936
data_files:
- split: train
path: subset_1936/train-*
- config_name: subset_1937
data_files:
- split: train
path: subset_1937/train-*
- config_name: subset_1938
data_files:
- split: train
path: subset_1938/train-*
- config_name: subset_1939
data_files:
- split: train
path: subset_1939/train-*
- config_name: subset_1940
data_files:
- split: train
path: subset_1940/train-*
- config_name: subset_1941
data_files:
- split: train
path: subset_1941/train-*
- config_name: subset_1942
data_files:
- split: train
path: subset_1942/train-*
- config_name: subset_1943
data_files:
- split: train
path: subset_1943/train-*
- config_name: subset_1944
data_files:
- split: train
path: subset_1944/train-*
- config_name: subset_1945
data_files:
- split: train
path: subset_1945/train-*
- config_name: subset_1946
data_files:
- split: train
path: subset_1946/train-*
- config_name: subset_1947
data_files:
- split: train
path: subset_1947/train-*
- config_name: subset_1948
data_files:
- split: train
path: subset_1948/train-*
- config_name: subset_1949
data_files:
- split: train
path: subset_1949/train-*
- config_name: subset_1950
data_files:
- split: train
path: subset_1950/train-*
- config_name: subset_1951
data_files:
- split: train
path: subset_1951/train-*
- config_name: subset_1952
data_files:
- split: train
path: subset_1952/train-*
- config_name: subset_1953
data_files:
- split: train
path: subset_1953/train-*
- config_name: subset_1954
data_files:
- split: train
path: subset_1954/train-*
- config_name: subset_1955
data_files:
- split: train
path: subset_1955/train-*
- config_name: subset_1956
data_files:
- split: train
path: subset_1956/train-*
- config_name: subset_1957
data_files:
- split: train
path: subset_1957/train-*
- config_name: subset_1958
data_files:
- split: train
path: subset_1958/train-*
- config_name: subset_1959
data_files:
- split: train
path: subset_1959/train-*
- config_name: subset_1960
data_files:
- split: train
path: subset_1960/train-*
- config_name: subset_1961
data_files:
- split: train
path: subset_1961/train-*
- config_name: subset_1962
data_files:
- split: train
path: subset_1962/train-*
- config_name: subset_1963
data_files:
- split: train
path: subset_1963/train-*
- config_name: subset_1964
data_files:
- split: train
path: subset_1964/train-*
- config_name: subset_1965
data_files:
- split: train
path: subset_1965/train-*
- config_name: subset_1966
data_files:
- split: train
path: subset_1966/train-*
- config_name: subset_1967
data_files:
- split: train
path: subset_1967/train-*
- config_name: subset_1968
data_files:
- split: train
path: subset_1968/train-*
- config_name: subset_1969
data_files:
- split: train
path: subset_1969/train-*
- config_name: subset_1970
data_files:
- split: train
path: subset_1970/train-*
- config_name: subset_1971
data_files:
- split: train
path: subset_1971/train-*
- config_name: subset_1972
data_files:
- split: train
path: subset_1972/train-*
- config_name: subset_1973
data_files:
- split: train
path: subset_1973/train-*
- config_name: subset_1974
data_files:
- split: train
path: subset_1974/train-*
- config_name: subset_1975
data_files:
- split: train
path: subset_1975/train-*
- config_name: subset_1976
data_files:
- split: train
path: subset_1976/train-*
- config_name: subset_1977
data_files:
- split: train
path: subset_1977/train-*
- config_name: subset_1978
data_files:
- split: train
path: subset_1978/train-*
- config_name: subset_1979
data_files:
- split: train
path: subset_1979/train-*
- config_name: subset_1980
data_files:
- split: train
path: subset_1980/train-*
- config_name: subset_1981
data_files:
- split: train
path: subset_1981/train-*
- config_name: subset_1982
data_files:
- split: train
path: subset_1982/train-*
- config_name: subset_1983
data_files:
- split: train
path: subset_1983/train-*
- config_name: subset_1984
data_files:
- split: train
path: subset_1984/train-*
- config_name: subset_1985
data_files:
- split: train
path: subset_1985/train-*
- config_name: subset_1986
data_files:
- split: train
path: subset_1986/train-*
- config_name: subset_1987
data_files:
- split: train
path: subset_1987/train-*
- config_name: subset_1988
data_files:
- split: train
path: subset_1988/train-*
- config_name: subset_1989
data_files:
- split: train
path: subset_1989/train-*
- config_name: subset_1990
data_files:
- split: train
path: subset_1990/train-*
- config_name: subset_1991
data_files:
- split: train
path: subset_1991/train-*
- config_name: subset_1992
data_files:
- split: train
path: subset_1992/train-*
- config_name: subset_1993
data_files:
- split: train
path: subset_1993/train-*
- config_name: subset_1994
data_files:
- split: train
path: subset_1994/train-*
- config_name: subset_1995
data_files:
- split: train
path: subset_1995/train-*
- config_name: subset_1996
data_files:
- split: train
path: subset_1996/train-*
- config_name: subset_1997
data_files:
- split: train
path: subset_1997/train-*
- config_name: subset_1998
data_files:
- split: train
path: subset_1998/train-*
- config_name: subset_1999
data_files:
- split: train
path: subset_1999/train-*
- config_name: subset_2000
data_files:
- split: train
path: subset_2000/train-*
- config_name: subset_2001
data_files:
- split: train
path: subset_2001/train-*
- config_name: subset_2002
data_files:
- split: train
path: subset_2002/train-*
- config_name: subset_2003
data_files:
- split: train
path: subset_2003/train-*
- config_name: subset_2004
data_files:
- split: train
path: subset_2004/train-*
- config_name: subset_2005
data_files:
- split: train
path: subset_2005/train-*
- config_name: subset_2006
data_files:
- split: train
path: subset_2006/train-*
- config_name: subset_2007
data_files:
- split: train
path: subset_2007/train-*
- config_name: subset_2008
data_files:
- split: train
path: subset_2008/train-*
- config_name: subset_2009
data_files:
- split: train
path: subset_2009/train-*
- config_name: subset_2010
data_files:
- split: train
path: subset_2010/train-*
- config_name: subset_2011
data_files:
- split: train
path: subset_2011/train-*
- config_name: subset_2012
data_files:
- split: train
path: subset_2012/train-*
- config_name: subset_2013
data_files:
- split: train
path: subset_2013/train-*
- config_name: subset_2014
data_files:
- split: train
path: subset_2014/train-*
- config_name: subset_2015
data_files:
- split: train
path: subset_2015/train-*
- config_name: subset_2016
data_files:
- split: train
path: subset_2016/train-*
- config_name: subset_2017
data_files:
- split: train
path: subset_2017/train-*
- config_name: subset_2018
data_files:
- split: train
path: subset_2018/train-*
- config_name: subset_2019
data_files:
- split: train
path: subset_2019/train-*
- config_name: subset_2020
data_files:
- split: train
path: subset_2020/train-*
- config_name: subset_2021
data_files:
- split: train
path: subset_2021/train-*
- config_name: subset_2022
data_files:
- split: train
path: subset_2022/train-*
- config_name: subset_2023
data_files:
- split: train
path: subset_2023/train-*
- config_name: subset_2024
data_files:
- split: train
path: subset_2024/train-*
- config_name: subset_2025
data_files:
- split: train
path: subset_2025/train-*
---
# Royal Gazette Thailand (Ratchakitcha) Dataset
**ชุดข้อมูลราชกิจจานุเบกษา (แบบ Machine Readable)**
โครงการ **Open Law Data Thailand** ร่วมกับคณะกรรมาธิการการพาณิชย์และการอุตสาหกรรม วุฒิสภา ได้รับความอนุเคราะห์ข้อมูลจาก **สำนักเลขาธิการคณะรัฐมนตรี (สลค.)** เพื่อเผยแพร่ข้อมูลกฎหมายไทยสู่สาธารณะในรูปแบบที่ประมวลผลได้ด้วยคอมพิวเตอร์ (Machine Readable) เพื่อส่งเสริมนวัตกรรม Legal Tech และ AI ของประเทศไทย
## Dataset Description
ชุดข้อมูลนี้รวบรวมรายการประกาศในราชกิจจานุเบกษา ประกอบด้วยชื่อเรื่อง เล่ม ตอน วันที่ประกาศ และลิงก์ไปยังต้นฉบับ PDF เหมาะสำหรับการทำ RAG (Retrieval-Augmented Generation), การสืบค้นกฎหมาย, และการวิเคราะห์ข้อมูลภาครัฐ
- **Source:** สำนักเลขาธิการคณะรัฐมนตรี (The Secretariat of the Cabinet)
- **Official Collaboration Reference:** หนังสือด่วนที่สุด ที่ นร ๐๕๐๓/๘๗๓๙ (29 ก.ค. 2568)
- **Homepage:** [Open Law Data Thailand](https://www.openlawdatathailand.org/)
- **Original:** [Open Law Data Thailand](https://huggingface.co/datasets/open-law-data-thailand/soc-ratchakitcha)
## Usage Instruction
ท่านสามารถเลือกดาวน์โหลดข้อมูลได้หลายรูปแบบผ่าน Library `datasets` ของ Hugging Face โดยระบุชื่อ `name` ในพารามิเตอร์ (Config)
### 1. สำหรับงาน AI / NLP (แนะนำ) ⭐
หากต้องการข้อความ (Text) เพื่อนำไปเทรนโมเดล หรือทำ Search Engine ท่านสามารถเลือกโหลดข้อมูลแยกเป็น **"รายทศวรรษ"** (Decade Subsets) ได้ ซึ่งจะได้ทั้งไฟล์ OCR และ Metadata ควบคู่กัน
```python
from datasets import load_dataset
# ตัวอย่าง: โหลดข้อมูลปี 2025 (พ.ศ. 2568)
# จะได้ทั้ง Text (OCR) และ Metadata
ds = load_dataset("JonusNattapong/Ratchakitcha", name="subset_2025")
print(ds['train'][0])
````
**รายชื่อ Subset ที่รองรับ:**
* `subset_2025`, `subset_2020s` (ปัจจุบัน)
* `subset_2010s`, `subset_2000s`, `subset_1990s`, ... จนถึง `subset_1960s`
* `subset_pre_1960` (ข้อมูลประวัติศาสตร์ก่อนปี 1960/2503)
### 2\. สำหรับการวิเคราะห์ข้อมูล (Metadata Only)
หากต้องการวิเคราะห์สถิติ เช่น จำนวนกฎหมายในแต่ละปี หรือค้นหาชื่อเรื่อง โดยไม่ต้องการเนื้อหา Text
```python
# โหลดเฉพาะ Metadata ทั้งหมด (ไฟล์เล็ก โหลดเร็ว)
ds_meta = load_dataset("JonusNattapong/Ratchakitcha", name="meta")
```
### 3\. เลือกปีเฉพาะ
หากต้องการโหลดเฉพาะปีหนึ่งๆ ท่านสามารถโหลด subset ทศวรรษแล้ว filter ตาม `publishDate`
```python
from datasets import load_dataset
year = "2025" # เลือกปีที่ต้องการ (เช่น "2025")
# เลือก subset ตามทศวรรษ
decade = str(int(year) // 10 * 10) + 's'
subset_name = f'subset_{decade}'
ds = load_dataset("JonusNattapong/Ratchakitcha", name=subset_name)
# filter ตามปี
ds_filtered = ds.filter(lambda x: x['publishDate'].startswith(year))
print(ds_filtered['train'][0])
```
**หมายเหตุ:** หากปีที่เลือกไม่อยู่ใน subset ที่มี (เช่น ปีก่อน 1960) ให้ใช้ `subset_pre_1960` และปรับเงื่อนไข filter ตาม
## Data Fields
| Field Name | Description (TH) | Description (EN) |
| :--- | :--- | :--- |
| `no` | ลำดับที่เอกสาร | Document ID / Number |
| `doctitle` | ชื่อเรื่องหรือหัวข้อของเอกสาร | Title or topic of the document |
| `bookNo` | เล่มของราชกิจจานุเบกษา | Book number |
| `section` | ตอนของราชกิจจานุเบกษา | Section number |
| `category` | ประเภท (เช่น ก, ข, ง) | Category (e.g., A, B, D) |
| `publishDate` | วันที่ประกาศในราชกิจจานุเบกษา | Publication date |
| `pageNo` | หมายเลขหน้า | Page number |
| `pdf_file` | ชื่อไฟล์ PDF ต้นฉบับ | Filename of the source PDF |
## Legal & License
ข้อมูลนี้ได้รับการสนับสนุนจาก **สำนักเลขาธิการคณะรัฐมนตรี** ตามหนังสือตอบข้อหารือ "ด่วนที่สุด ที่ นร ๐๕๐๓/๘๗๓๙" ลงวันที่ 29 กรกฎาคม 2568 เพื่อประโยชน์สาธารณะและการพัฒนาเทคโนโลยีปัญญาประดิษฐ์ (AI)
**Disclaimer:** ข้อมูลนี้จัดทำขึ้นเพื่อความสะดวกในการเข้าถึงและวิเคราะห์ข้อมูลเท่านั้น การอ้างอิงทางกฎหมายอย่างเป็นทางการควรตรวจสอบกับต้นฉบับ PDF จากเว็บไซต์ [ratchakitcha.soc.go.th](https://ratchakitcha.soc.go.th/) โดยตรง
## Contact
- **Project:** Open Law Data Thailand
- **Website:** https://www.openlawdatathailand.org/
提供机构:
JonusNattapong



