five

almanach/HALvest-Geometric

收藏
Hugging Face2024-07-31 更新2025-04-12 收录
下载链接:
https://hf-mirror.com/datasets/almanach/HALvest-Geometric
下载链接
链接失效反馈
官方服务:
资源简介:
--- pretty_name: HALvest-Geometric license: cc-by-4.0 configs: - config_name: en data_files: "en/*.gz" - config_name: fr data_files: "fr/*.gz" language: - en - fr size_categories: - 100K<n<1M task_categories: - text-generation - fill-mask task_ids: - language-modeling - masked-language-modeling tags: - academia - research - graph annotations_creators: - no-annotation multilinguality: - multilingual source_datasets: - HALvest --- <div align="center"> <h1> HALvest-Geometric </h1> <h3> Citation Network of Open Scientific Papers Harvested from HAL </h3> </div> --- ## Dataset Description - **Repository:** [GitHub](https://github.com/Madjakul/HALvesting-Geometric) ## Dataset Summary ### overview: French and English fulltexts from open papers found on [Hyper Articles en Ligne (HAL)](https://hal.science/) and its citation network. You can download the dataset using Hugging Face datasets: ```py from datasets import load_dataset ds = load_dataset("Madjakul/HALvest-Geometric", "en") ``` ### Details #### Nodes * Papers: 18,662,037 * Authors: 238,397 * Affiliations: 96,105 * Domains: 16 #### Edges - Paper <-> Domain: 136,700 - Paper <-> Paper: 22,363,817 - Author <-> Paper: 238,397 - Author <-> Affiliation: 426,030 ### Languages ISO-639|Language|# Documents|# mT5 Tokens -------|--------|-----------|-------- en|English|442,892|7,606,895,258 fr|French|193,437|8,728,722,255 ## Considerations for Using the Data The corpus is extracted from the [HAL's open archive](https://hal.science/) which distributes scientific publications following open access principles. The corpus is made up of both creative commons licensed and copyrighted documents (distribution authorized on HAL by the publisher). This must be considered prior to using this dataset for any purpose, other than training deep learning models, data mining etc. We do not own any of the text from which these data has been extracted. ## Dataset Copyright The licence terms for HALvest strictly follows the one from HAL. Please refer to the below license when using this dataset. - [HAL license](https://doc.archives-ouvertes.fr/en/legal-aspects/) ## Citation ``` @misc{kulumba2024harvestingtextualstructureddata, title={Harvesting Textual and Structured Data from the HAL Publication Repository}, author={Francis Kulumba and Wissam Antoun and Guillaume Vimont and Laurent Romary}, year={2024}, eprint={2407.20595}, archivePrefix={arXiv}, primaryClass={cs.DL}, url={https://arxiv.org/abs/2407.20595}, } ```

pretty_name: HALvest-Geometric license: cc-by-4.0 configs: - config_name: en data_files: "en/*.gz" - config_name: fr data_files: "fr/*.gz" language: - en - fr size_categories: - 10万 < 样本数 < 100万 task_categories: - 文本生成 - 掩码填充 task_ids: - 语言建模 - 掩码语言建模 tags: - 学术领域 - 研究 - 图谱 annotations_creators: - 无注释 multilinguality: - 多语言 source_datasets: - HALvest <div align="center"> <h1> HALvest-Geometric 数据集</h1> <h3> 从HAL平台采集的开放科学论文引用网络</h3> </div> --- ## 数据集说明 - **仓库地址:** [GitHub](https://github.com/Madjakul/HALvesting-Geometric) ## 数据集概览 ### 总览: 本数据集包含从[Hyper Articles en Ligne(HAL)](https://hal.science/)平台获取的开放获取论文的英法双语全文及其对应的引用网络。 用户可通过Hugging Face datasets库下载本数据集: py from datasets import load_dataset ds = load_dataset("Madjakul/HALvest-Geometric", "en") ### 详细信息 #### 节点类型 * 论文:18,662,037篇 * 作者:238,397位 * 所属机构:96,105个 * 研究领域:16个 #### 边关系 - 论文 <-> 研究领域:136,700条 - 论文 <-> 论文:22,363,817条 - 作者 <-> 论文:238,397条 - 作者 <-> 所属机构:426,030条 ### 语言分布 |ISO-639代码|语言|文档数量|mT5 Token 数量| |-------|--------|-----------|--------| |en|英语|442,892|7,606,895,258| |fr|法语|193,437|8,728,722,255| ## 数据使用注意事项 本数据集语料库源自[HAL开放存档库](https://hal.science/),该平台遵循开放获取原则发布科学出版物。本语料库包含知识共享许可协议及受版权保护的文档(出版商已授权HAL平台进行分发)。在将本数据集用于深度学习模型训练、数据挖掘等任何用途前,需充分考虑上述版权情况。本团队不拥有提取自这些文本的任何所有权。 ## 数据集版权 HALvest数据集的许可条款严格遵循HAL平台的相关规定。使用本数据集时,请参考以下许可协议: - [HAL许可协议](https://doc.archives-ouvertes.fr/en/legal-aspects/) ## 引用格式 @misc{kulumba2024harvestingtextualstructureddata, title={Harvesting Textual and Structured Data from the HAL Publication Repository}, author={Francis Kulumba and Wissam Antoun and Guillaume Vimont and Laurent Romary}, year={2024}, eprint={2407.20595}, archivePrefix={arXiv}, primaryClass={cs.DL}, url={https://arxiv.org/abs/2407.20595}, }
提供机构:
almanach
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作