kilian-group/arxiv-classifier

Name: kilian-group/arxiv-classifier
Creator: kilian-group
Published: 2024-09-17 23:09:17
License: 暂无描述

Hugging Face2024-09-17 更新2025-04-12 收录

下载链接：

https://hf-mirror.com/datasets/kilian-group/arxiv-classifier

下载链接

链接失效反馈

官方服务：

资源简介：

--- configs: - config_name: default data_files: - split: train path: "minor/train.json" - split: test path: "minor/test.json" - config_name: major data_files: - split: train path: "major/train.json" - split: test path: "major/test.json" - config_name: all2023 data_files: - split: val path: "all2023/val.json" - config_name: all2023_v2 data_files: - split: train path: "all2023_v2/train.json" - split: test path: "all2023_v2/test.json" --- # arXiv Classifier Data Usage: ``` from datasets import load_dataset, DownloadMode # download from HuggingFace dataset = load_dataset('mlcore/arxiv-classifier', name=<CONFIG NAME>) # load from G2 dataset = load_dataset('/share/nikola/arxiv_classifier/data/arxiv-classifier', name=<CONFIG NAME>) ``` To force the dataset to be re-generated: ``` dataset = load_dataset('/share/nikola/arxiv_classifier/data/arxiv-classifier', name=<CONFIG NAME>, download_mode=DownloadMode.FORCE_REDOWNLOAD) ``` See: https://huggingface.co/docs/datasets/v2.20.0/en/package_reference/builder_classes#datasets.DownloadMode Standardized terminology: - Field: Bio/cs/physics - Subfield: Subcategories within each - Primary subfield (bio., cs.LG): Given primary subfield, you can infer the field - Secondary subfields: Includes primary subfield, but also includes any subfields that were tagged in the paper (1-5) Old terminology to standardized terminology translation: - Prime category = Primary subfield - Abstract category = secondary subfield - Major category = field Original data: https://www.dropbox.com/scl/fo/wwu0ifghw4sco09g67frb/h?rlkey=6ddg3yab9la3zeddvmnsfktxq&e=1&dl=0 **Minor (default)**: Dataset of papers between 2010 and 2020 (with some pre-2010 papers) with balanced primary subfields. **Major**: Dataset of papers between 2010 and 2020 (with some pre-2010 papers) with unbalanced primary subfields to better represent the true distribution of primary categories, which is dominated by a few subfields. Note that the distribution of major subfields is still truncated. **All 2023**: All papers published in 2023. - Train: papers with date between January and June (inclusive) - Test: papers with date between July and December (inclusive) ![Primary subfield distribution](figures/Primary_subfield_distribution.png) ![Secondary subfield distribution](figures/Secondary_subfield_distribution.png) To generate subfield distribution plots: ``` python plot_subfield_distributions.py --output_path <path_to_save_plots> ``` ## Set up dependencies Get subfields to ignore and subfield aliases: ``` # in your conda env git clone https://github.com/ag2435/arxiv-classifier-next cd arxiv-classifier-next conda develop . ``` ## Preprocessing Transform raw data into JSON format: ``` # major/minor cats data python preprocess_major_minor.py -d <DATASET NAME> -s <SPLIT> -op <PATH TO SAVE PREPROCESSED DATA> # all 2023 corpus (full text) python preprocess_all2023_v2.py ``` Additional data checks: ``` python test_paper_id.py ``` :white_check_mark: Checked that there is no data leakage between train and test splits for each dataset config :white_check_mark: Checked validity of arXiv identifiers for each paper

提供机构：

kilian-group

5,000+

优质数据集

54 个

任务类型

进入经典数据集