five

LLMsForHepth/hep-th_hep-ph_gr-qc_primary

收藏
Hugging Face2024-09-20 更新2025-04-26 收录
下载链接:
https://hf-mirror.com/datasets/LLMsForHepth/hep-th_hep-ph_gr-qc_primary
下载链接
链接失效反馈
官方服务:
资源简介:
--- dataset_info: features: - name: id dtype: string - name: submitter dtype: string - name: authors dtype: string - name: title dtype: string - name: comments dtype: string - name: journal-ref dtype: string - name: doi dtype: string - name: report-no dtype: string - name: categories dtype: string - name: license dtype: string - name: orig_abstract dtype: string - name: versions list: - name: created dtype: string - name: version dtype: string - name: update_date dtype: string - name: authors_parsed sequence: sequence: string - name: abstract dtype: string splits: - name: train num_bytes: 443514798.708732 num_examples: 210905 - name: test num_bytes: 95039035.64563398 num_examples: 45194 - name: validation num_bytes: 95039035.64563398 num_examples: 45194 download_size: 355804687 dataset_size: 633592869.9999999 configs: - config_name: default data_files: - split: train path: data/train-* - split: test path: data/test-* - split: validation path: data/validation-* --- # Dataset Card for hep-th_hep-ph_gr-qc_primary Dataset ## Dataset Description - **Homepage:** [Kaggle arXiv Dataset Homepage](https://www.kaggle.com/Cornell-University/arxiv) - **Repository:** [hepthLlama](https://github.com/Paul-Richmond/hepthLlama) - **Paper:** [tbd](tbd) - **Point of Contact:** [Paul Richmond](mailto:p.richmond@qmul.ac.uk) ### Dataset Summary This dataset contains metadata included in arXiv submissions. ## Dataset Structure An example from the dataset looks as follows: ``` {'id': '0908.2896', 'submitter': 'Paul Richmond', 'authors': 'Neil Lambert, Paul Richmond', 'title': 'M2-Branes and Background Fields', 'comments': '19 pages', 'journal-ref': 'JHEP 0910:084,2009', 'doi': '10.1088/1126-6708/2009/10/084', 'report-no': None, 'categories': 'hep-th', 'license': 'http://arxiv.org/licenses/nonexclusive-distrib/1.0/', 'abstract': ' We discuss the coupling of multiple M2-branes to the background 3-form and\n6-form gauge fields of eleven-dimensional supergravity, including the coupling\nof the Fermions. In particular we show in detail how a natural generalization\nof the Myers flux-terms, along with the resulting curvature of the background\nmetric, leads to mass terms in the effective field theory.\n', 'versions': [{'created': 'Thu, 20 Aug 2009 14:23:37 GMT', 'version': 'v1'}], 'update_date': '2009-11-09', 'authors_parsed': [['Lambert', 'Neil', ''], ['Richmond', 'Paul', '']]} ``` ### Languages The text in the `abstract` field of the dataset is in English, however there may be examples where the abstract also contains a translation into another language. ## Dataset Creation ### Curation Rationale The starting point was to load v193 of the Kaggle arXiv Dataset which includes arXiv submissions upto 23rd August 2024. The arXiv dataset contains the following data fields: - `id`: ArXiv ID (can be used to access the paper) - `submitter`: Who submitted the paper - `authors`: Authors of the paper - `title`: Title of the paper - `comments`: Additional info, such as number of pages and figures - `journal-ref`: Information about the journal the paper was published in - `doi`: [Digital Object Identifier](https://www.doi.org) - `report-no`: Report Number - `abstract`: The abstract of the paper - `categories`: Categories / tags in the ArXiv system To arrive at the hep-th_hep-ph_gr-qc_primary dataset, the full arXiv data was filtered so that only `categories` which included 'hep-th', 'hep-ph' or 'gr-qc' were retained. This resulted in papers that were either primarily classified as 'hep-th', 'hep-ph' or 'gr-qc' or appeared cross-listed. For this dataset, the decision was made to focus only on papers primarily classified as any of 'hep-th', 'hep-ph' or 'gr-qc'. This meant taking only those abstracts where the first characters in `categories` were any of 'hep-th', 'hep-ph' or 'gr-qc' (see [here](https://info.arxiv.org/help/arxiv_identifier_for_services.html#indications-of-classification) for more details). We also dropped entries whose `abstract` or `comments` contained the word 'Withdrawn' or 'withdrawn' and we removed the five records which appear in the repo `LLMsForHepth/arxiv_hepth_first_overfit`. In addition, we have cleaned the data appearing in `abstract` by first replacing all occurences of '\n' with a whitespace and then removing any leading and trailing whitespace. ### Data splits The dataset is split into a training, validation and test set with split percentages 70%, 15% and 15%. This was done by applying `train_test_split` twice (both with `seed=42`). The final split sizes are as follows: | Train | Test | Validation | |:---:|:---:|:---:| |210,905 | 45,194| 45,194 |
提供机构:
LLMsForHepth
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作