five

almanach/halvest-contrastive

收藏
Hugging Face2025-11-26 更新2026-04-05 收录
下载链接:
https://hf-mirror.com/datasets/almanach/halvest-contrastive
下载链接
链接失效反馈
官方服务:
资源简介:
--- dataset_info: - config_name: base-10 features: - name: query_halid dtype: string - name: query dtype: string - name: query_year dtype: string - name: query_domain sequence: string - name: query_affiliations sequence: string - name: query_authorids sequence: string - name: pos_halid dtype: string - name: positive dtype: string - name: pos_year dtype: string - name: pos_domain sequence: string - name: pos_affiliations sequence: string - name: pos_authorids sequence: string - name: neg_halids dtype: string - name: negative dtype: string - name: neg_year dtype: string - name: neg_domain sequence: string - name: neg_affiliations sequence: string - name: neg_authorids sequence: string splits: - name: train num_bytes: 3834152941.0486026 num_examples: 730394 - name: test num_bytes: 39129259.033585 num_examples: 7454 - name: valid num_bytes: 39124009.60253676 num_examples: 7453 download_size: 2227392557 dataset_size: 3912406209.6847243 - config_name: base-2 features: - name: query_halid dtype: string - name: query dtype: string - name: query_year dtype: string - name: query_domain sequence: string - name: query_affiliations sequence: string - name: query_authorids sequence: string - name: pos_halid dtype: string - name: positive dtype: string - name: pos_year dtype: string - name: pos_domain sequence: string - name: pos_affiliations sequence: string - name: pos_authorids sequence: string - name: neg_halids dtype: string - name: negative dtype: string - name: neg_year dtype: string - name: neg_domain sequence: string - name: neg_affiliations sequence: string - name: neg_authorids sequence: string splits: - name: train num_bytes: 2439002041.429833 num_examples: 1875668 - name: test num_bytes: 24888465.908128202 num_examples: 19140 - name: valid num_bytes: 24887165.57030646 num_examples: 19139 download_size: 1472798499 dataset_size: 2488777672.9082675 - config_name: base-4 features: - name: query_halid dtype: string - name: query dtype: string - name: query_year dtype: string - name: query_domain sequence: string - name: query_affiliations sequence: string - name: query_authorids sequence: string - name: pos_halid dtype: string - name: positive dtype: string - name: pos_year dtype: string - name: pos_domain sequence: string - name: pos_affiliations sequence: string - name: pos_authorids sequence: string - name: neg_halids dtype: string - name: negative dtype: string - name: neg_year dtype: string - name: neg_domain sequence: string - name: neg_affiliations sequence: string - name: neg_authorids sequence: string splits: - name: train num_bytes: 3207787504.4813294 num_examples: 1401495 - name: test num_bytes: 32732595.62223732 num_examples: 14301 - name: valid num_bytes: 32732595.62223732 num_examples: 14301 download_size: 1922379318 dataset_size: 3273252695.725804 - config_name: base-6 features: - name: query_halid dtype: string - name: query dtype: string - name: query_year dtype: string - name: query_domain sequence: string - name: query_affiliations sequence: string - name: query_authorids sequence: string - name: pos_halid dtype: string - name: positive dtype: string - name: pos_year dtype: string - name: pos_domain sequence: string - name: pos_affiliations sequence: string - name: pos_authorids sequence: string - name: neg_halids dtype: string - name: negative dtype: string - name: neg_year dtype: string - name: neg_domain sequence: string - name: neg_affiliations sequence: string - name: neg_authorids sequence: string splits: - name: train num_bytes: 3642651240.5619364 num_examples: 1111598 - name: test num_bytes: 37170445.63024946 num_examples: 11343 - name: valid num_bytes: 37170445.63024946 num_examples: 11343 download_size: 2152793615 dataset_size: 3716992131.8224354 - config_name: base-8 features: - name: query_halid dtype: string - name: query dtype: string - name: query_year dtype: string - name: query_domain sequence: string - name: query_affiliations sequence: string - name: query_authorids sequence: string - name: pos_halid dtype: string - name: positive dtype: string - name: pos_year dtype: string - name: pos_domain sequence: string - name: pos_affiliations sequence: string - name: pos_authorids sequence: string - name: neg_halids dtype: string - name: negative dtype: string - name: neg_year dtype: string - name: neg_domain sequence: string - name: neg_affiliations sequence: string - name: neg_authorids sequence: string splits: - name: train num_bytes: 3802480108.820647 num_examples: 891803 - name: test num_bytes: 38804950.72384451 num_examples: 9101 - name: valid num_bytes: 38800686.91209593 num_examples: 9100 download_size: 2230176353 dataset_size: 3880085746.4565873 - config_name: ict-1 features: - name: halid dtype: string - name: year dtype: string - name: affiliations sequence: string - name: domains sequence: string - name: authors sequence: string - name: query dtype: string - name: positive dtype: string - name: negative dtype: string splits: - name: train num_bytes: 270852390.17447275 num_examples: 295645 - name: test num_bytes: 2763996.215584178 num_examples: 3017 - name: valid num_bytes: 2763996.215584178 num_examples: 3017 download_size: 181600181 dataset_size: 276380382.60564107 - config_name: ict-2 features: - name: halid dtype: string - name: year dtype: string - name: affiliations sequence: string - name: domains sequence: string - name: authors sequence: string - name: query dtype: string - name: positive dtype: string - name: negative dtype: string splits: - name: train num_bytes: 347528995.34114176 num_examples: 203771 - name: test num_bytes: 3547415.0409507477 num_examples: 2080 - name: valid num_bytes: 3545709.552950291 num_examples: 2079 download_size: 221935538 dataset_size: 354622119.9350428 - config_name: ict-3 features: - name: halid dtype: string - name: year dtype: string - name: affiliations sequence: string - name: domains sequence: string - name: authors sequence: string - name: query dtype: string - name: positive dtype: string - name: negative dtype: string splits: - name: train num_bytes: 433973609.79633397 num_examples: 174150 - name: test num_bytes: 4430692.381383185 num_examples: 1778 - name: valid num_bytes: 4428200.428412779 num_examples: 1777 download_size: 271221960 dataset_size: 442832502.60612994 - config_name: ict-4 features: - name: halid dtype: string - name: year dtype: string - name: affiliations sequence: string - name: domains sequence: string - name: authors sequence: string - name: query dtype: string - name: positive dtype: string - name: negative dtype: string splits: - name: train num_bytes: 491254281.5417554 num_examples: 149850 - name: test num_bytes: 5015809.481207112 num_examples: 1530 - name: valid num_bytes: 5012531.17435665 num_examples: 1529 download_size: 302317093 dataset_size: 501282622.19731915 configs: - config_name: base-10 data_files: - split: train path: base-10/train-* - split: test path: base-10/test-* - split: valid path: base-10/valid-* - config_name: base-2 data_files: - split: train path: base-2/train-* - split: test path: base-2/test-* - split: valid path: base-2/valid-* - config_name: base-4 data_files: - split: train path: base-4/train-* - split: test path: base-4/test-* - split: valid path: base-4/valid-* - config_name: base-6 data_files: - split: train path: base-6/train-* - split: test path: base-6/test-* - split: valid path: base-6/valid-* - config_name: base-8 data_files: - split: train path: base-8/train-* - split: test path: base-8/test-* - split: valid path: base-8/valid-* - config_name: ict-1 data_files: - split: train path: ict-1/train-* - split: test path: ict-1/test-* - split: valid path: ict-1/valid-* - config_name: ict-2 data_files: - split: train path: ict-2/train-* - split: test path: ict-2/test-* - split: valid path: ict-2/valid-* - config_name: ict-3 data_files: - split: train path: ict-3/train-* - split: test path: ict-3/test-* - split: valid path: ict-3/valid-* - config_name: ict-4 data_files: - split: train path: ict-4/train-* - split: test path: ict-4/test-* - split: valid path: ict-4/valid-* task_categories: - text-classification - feature-extraction language: - en pretty_name: HALvest-Contrastive size_categories: - 1M<n<10M --- <div align="center"> <h1> HALvest-Contrastive </h1> <h3> Contrastive triplets Harvested from HAL </h3> </div> --- ## Citation ```bib @misc{kulumba2024harvestingtextualstructureddata, title={Harvesting Textual and Structured Data from the HAL Publication Repository}, author={Francis Kulumba and Wissam Antoun and Guillaume Vimont and Laurent Romary}, year={2024}, eprint={2407.20595}, archivePrefix={arXiv}, primaryClass={cs.DL}, url={https://arxiv.org/abs/2407.20595}, } ``` ## Dataset Copyright The licence terms for HALvest strictly follows the one from HAL. Please refer to the below license when using this dataset. - [HAL license](https://doc.archives-ouvertes.fr/en/legal-aspects/)
提供机构:
almanach
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作