five

almanach/HALvest

收藏
Hugging Face2024-07-31 更新2025-04-12 收录
下载链接:
https://hf-mirror.com/datasets/almanach/HALvest
下载链接
链接失效反馈
官方服务:
资源简介:
--- pretty_name: HALvest configs: - config_name: ar data_files: "ar/*.gz" - config_name: az data_files: "az/*.gz" - config_name: bg data_files: "bg/*.gz" - config_name: bo data_files: "bo/*.gz" - config_name: br data_files: "br/*.gz" - config_name: bs data_files: "bs/*.gz" - config_name: ca data_files: "ca/*.gz" - config_name: co data_files: "co/*.gz" - config_name: cs data_files: "cs/*.gz" - config_name: da data_files: "da/*.gz" - config_name: de data_files: "de/*.gz" - config_name: el data_files: "el/*.gz" - config_name: en data_files: "en/*.gz" - config_name: eo data_files: "eo/*.gz" - config_name: es data_files: "es/*.gz" - config_name: et data_files: "et/*.gz" - config_name: eu data_files: "eu/*.gz" - config_name: fa data_files: "fa/*.gz" - config_name: fi data_files: "fi/*.gz" - config_name: fr data_files: "fr/*.gz" - config_name: gl data_files: "gl/*.gz" - config_name: gn data_files: "gn/*.gz" - config_name: he data_files: "he/*.gz" - config_name: hi data_files: "hi/*.gz" - config_name: hr data_files: "hr/*.gz" - config_name: hu data_files: "hu/*.gz" - config_name: hy data_files: "hy/*.gz" - config_name: id data_files: "id/*.gz" - config_name: ie data_files: "ie/*.gz" - config_name: it data_files: "it/*.gz" - config_name: ja data_files: "ja/*.gz" - config_name: kk data_files: "kk/*.gz" - config_name: ko data_files: "ko/*.gz" - config_name: lt data_files: "lt/*.gz" - config_name: mk data_files: "mk/*.gz" - config_name: mr data_files: "mr/*.gz" - config_name: "no" data_files: "no/*.gz" - config_name: oc data_files: "oc/*.gz" - config_name: pl data_files: "pl/*.gz" - config_name: pt data_files: "pt/*.gz" - config_name: ro data_files: "ro/*.gz" - config_name: ru data_files: "ru/*.gz" - config_name: sk data_files: "sk/*.gz" - config_name: sl data_files: "sl/*.gz" - config_name: sq data_files: "sq/*.gz" - config_name: sr data_files: "sr/*.gz" - config_name: sv data_files: "sv/*.gz" - config_name: sw data_files: "sw/*.gz" - config_name: ta data_files: "ta/*.gz" - config_name: tet data_files: "tet/*.gz" - config_name: th data_files: "th/*.gz" - config_name: tk data_files: "tk/*.gz" - config_name: tl data_files: "tl/*.gz" - config_name: tr data_files: "tr/*.gz" - config_name: uk data_files: "uk/*.gz" - config_name: vi data_files: "vi/*.gz" - config_name: zh data_files: "zh/*.gz" language: - ar - az - bg - bo - br - bs - ca - co - cs - da - de - el - en - eo - es - et - eu - fa - fi - fr - gl - gn - he - hi - hr - hu - hy - id - ie - it - ja - kk - ko - lt - mk - mr - "no" - oc - pl - pt - ro - ru - sk - sl - sq - sr - sv - sw - ta - tet - th - tk - tl - tr - uk - vi - zh size_categories: - n<1K - 1K<n<10K - 10K<n<100K - 100K<n<1M task_categories: - text-generation - fill-mask task_ids: - language-modeling - masked-language-modeling tags: - academia - research annotations_creators: - no-annotation multilinguality: - multilingual source_datasets: - original --- <div align="center"> <h1> HALvest </h1> <h3> Open Scientific Papers Harvested from HAL (Unfiltered) </h3> </div> --- ## Dataset Description - **Repository:** [GitHub](https://github.com/Madjakul/HALvesting/tree/main) ## Dataset Summary ### overview: This is the unfiltered version of [HALvest](https://huggingface.co/datasets/Madjakul/HALvest), comprising of fulltext from open papers found on [Hyper Articles en Ligne (HAL)](https://hal.science/) with extra fields for potential filtering. Our dump is mostly english/french but gather papers written in 56 languages across 13 domains. You can download the dataset using Hugging Face datasets: ```py from datasets import load_dataset ds = load_dataset("almanach/HALvest", "en") ``` ### Details Building the dataset is a three steps process: data fetching from HAL, data merging and data enriching. 1. We first request [HAL's API](https://api.archives-ouvertes.fr/docs) in order to gather open research papers and parse it -- effectively sorting papers by language. Then, we download the PDFs of the fetched data. 2. Using [GROBID](https://github.com/kermitt2/grobid), we convert each PDF to an `xml-tei` format in order to have structured data. We convert each `xml-tei` file to a `txt` format before concatenating it with the paper's. 3. Finally, we compute some statistics about each document. ### Languages Please, note that the number of tokens is highly inflated in the raw version of the dataset because of badly encoded PDFs, translating to gibberish documents/texts. ISO-639|Language|# Documents|# mT5 Tokens -------|--------|-----------|-------- en|English|464,679|8,158,933,235 fr|French|199,216|9,018,529,985 es|Spanish|2,975|69,221,667 it|Italian|1,172|48,747,986 pt|Portuguese|934|32,918,832 de|German|652|12,225,960 ru|Russian|245|5,763,532 zh|Chinese|160|2,861,585 eu|Basque|113|2,297,485 ar|Arabic|92|2,167,431 ja|Japanese|92|547,861 el|Greek|54|1,738,878 pl|Polish|43|987,878 ro|Romanian|39|1,298,901 uk|Ukrainian|34|837,793 vi|Viêt Namese|29|436,660 ca|Catalan|28|975,078 da|Danish|27|961,955 oc|Occitan|26|285,334 br|Breton|24|998,088 sr|Serbian|24|336,878 ko|Korean|17|226,268 fa|Persian|17|213,903 tr|Turkish|17|149,718 hu|Hungarian|14|577,568 eo|Esperanto|14|105,286 hy|Armenian|10|127,988 cs|Czech|9|712,263 bg|Bulgarian|9|208,763 sq|Albanian|9|98,009 id|Indonesian|9|53,075 he|Hebrew|8|61,283 hr|Croatian|8|40,621 et|Estonian|7|20,405 sv|Swedish|6|270,642 no|Norwegian|6|62,767 az|Azerbaijani|5|52,762 fi|Finnish|4|60,507 tet|Tetum|4|18,485 lt|Lithuanian|3|16,572 mr|Marathi|3|16,386 hi|Hindi|3|3,490 ie|Interlingue|2|140,383 ta|Tamil|2|77,087 sw|Swahili|2|73,921 tl|Tagalog|2|35,962 gl|Galician|2|29,688 mk|Macedonian|2|14,654 th|Thai|1|70,909 tk|Turkmen|1|66,104 bs|Bosnian|1|63,018 kk|Kazakh|1|41,839 sl|Slovenian|1|22,844 sk|Slovak|1|12,997 co|Corsican|1|9,083 gn|Guarani|1|1,566 bo|Tibetan|1|579 ### Domains Please, note that the number of tokens is highly inflated in the raw version of the dataset because of badly encoded PDFs, translating to gibberish documents/texts. Domain|Code|# Documents|# mT5 Tokens ------|----|-----------|------------ Humanities and Social Sciences|shs|156,566|5,614,423,171 Computer Science|info|148,316|2,573,673,455 Life Sciences|sdv|115,744|3,145,323,780 Engineering Sciences|spi|102,751|2,254,653,825 Physics|phys|65,991|1,503,190,749 Mathematics|math|62,921|1,638,500,361 Chemical Science|chim|40,012|899,507,319 Environmental Science|sde|31,575|579,076,669 Sciences of the Universe|sdu|23,557|682,356,264 Cognitive science|scco|11,772|227,487,096 Statistics|stat|10,579|184,678,350 Quantitative Finance|qfin|3,451|68,518,636 Nonlinear Sciences|nlin|1,972|30,694,088 You can browse through every domains and sub-domains here: https://hal.science/browse/domain. ## Considerations for Using the Data The corpus is extracted from the [HAL's open archive](https://hal.science/) which distributes scientific publications following open access principles. The corpus is made up of both creative commons licensed and copyrighted documents (distribution authorized on HAL by the publisher). This must be considered prior to using this dataset for any purpose, other than training deep learning models, data mining etc. We do not own any of the text from which these data has been extracted. ## Citation ```bib @misc{kulumba2024harvestingtextualstructureddata, title={Harvesting Textual and Structured Data from the HAL Publication Repository}, author={Francis Kulumba and Wissam Antoun and Guillaume Vimont and Laurent Romary}, year={2024}, eprint={2407.20595}, archivePrefix={arXiv}, primaryClass={cs.DL}, url={https://arxiv.org/abs/2407.20595}, } ``` ## Dataset Copyright The licence terms for HALvest strictly follows the one from HAL. Please refer to the below license when using this dataset. - [HAL license](https://doc.archives-ouvertes.fr/en/legal-aspects/)
提供机构:
almanach
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作