five

kilian-group/arxiv-classifier

收藏
Hugging Face2024-09-17 更新2025-04-12 收录
下载链接:
https://hf-mirror.com/datasets/kilian-group/arxiv-classifier
下载链接
链接失效反馈
官方服务:
资源简介:
--- configs: - config_name: default data_files: - split: train path: "minor/train.json" - split: test path: "minor/test.json" - config_name: major data_files: - split: train path: "major/train.json" - split: test path: "major/test.json" - config_name: all2023 data_files: - split: val path: "all2023/val.json" - config_name: all2023_v2 data_files: - split: train path: "all2023_v2/train.json" - split: test path: "all2023_v2/test.json" --- # arXiv Classifier Data Usage: ``` from datasets import load_dataset, DownloadMode # download from HuggingFace dataset = load_dataset('mlcore/arxiv-classifier', name=<CONFIG NAME>) # load from G2 dataset = load_dataset('/share/nikola/arxiv_classifier/data/arxiv-classifier', name=<CONFIG NAME>) ``` To force the dataset to be re-generated: ``` dataset = load_dataset('/share/nikola/arxiv_classifier/data/arxiv-classifier', name=<CONFIG NAME>, download_mode=DownloadMode.FORCE_REDOWNLOAD) ``` See: https://huggingface.co/docs/datasets/v2.20.0/en/package_reference/builder_classes#datasets.DownloadMode Standardized terminology: - Field: Bio/cs/physics - Subfield: Subcategories within each - Primary subfield (bio., cs.LG): Given primary subfield, you can infer the field - Secondary subfields: Includes primary subfield, but also includes any subfields that were tagged in the paper (1-5) Old terminology to standardized terminology translation: - Prime category = Primary subfield - Abstract category = secondary subfield - Major category = field Original data: https://www.dropbox.com/scl/fo/wwu0ifghw4sco09g67frb/h?rlkey=6ddg3yab9la3zeddvmnsfktxq&e=1&dl=0 **Minor (default)**: Dataset of papers between 2010 and 2020 (with some pre-2010 papers) with balanced primary subfields. **Major**: Dataset of papers between 2010 and 2020 (with some pre-2010 papers) with unbalanced primary subfields to better represent the true distribution of primary categories, which is dominated by a few subfields. Note that the distribution of major subfields is still truncated. **All 2023**: All papers published in 2023. - Train: papers with date between January and June (inclusive) - Test: papers with date between July and December (inclusive) ![Primary subfield distribution](figures/Primary_subfield_distribution.png) ![Secondary subfield distribution](figures/Secondary_subfield_distribution.png) To generate subfield distribution plots: ``` python plot_subfield_distributions.py --output_path <path_to_save_plots> ``` ## Set up dependencies Get subfields to ignore and subfield aliases: ``` # in your conda env git clone https://github.com/ag2435/arxiv-classifier-next cd arxiv-classifier-next conda develop . ``` ## Preprocessing Transform raw data into JSON format: ``` # major/minor cats data python preprocess_major_minor.py -d <DATASET NAME> -s <SPLIT> -op <PATH TO SAVE PREPROCESSED DATA> # all 2023 corpus (full text) python preprocess_all2023_v2.py ``` Additional data checks: ``` python test_paper_id.py ``` :white_check_mark: Checked that there is no data leakage between train and test splits for each dataset config :white_check_mark: Checked validity of arXiv identifiers for each paper <!-- [4782 Kaggle](https://www.kaggle.com/competitions/cs-4782-2024/overview) code: [4782_preprocess_data.ipynb](4782_preprocess_data.ipynb) [Internal Kaggle](https://www.kaggle.com/competitions/ar-xive-project-baselines/overview) code: [clean_data_for_hugging_face.ipynb](https://huggingface.co/datasets/mlcore/arxiv-classifier/blob/main/clean_data_for_hugging_face.ipynb) --> <!-- ## Todo - Can we incorporate additional metadata (e.g., authors)? Note: author names are contained in the full text-->
提供机构:
kilian-group
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作