five

effectiveML/ArXiv-10

收藏
Hugging Face2024-10-23 更新2024-12-14 收录
下载链接:
https://hf-mirror.com/datasets/effectiveML/ArXiv-10
下载链接
链接失效反馈
官方服务:
资源简介:
ArXiv-10数据集包含从10万篇ArXiv科学论文中提取的标题和摘要,涵盖了计算机科学、物理学和数学等十个不同的研究领域。为确保一致性和可管理性,每个类别精确地采样了1万个样本。该数据集为对科学文献领域文本分类任务感兴趣的研究人员和实践者提供了实用的资源。其特点是数据复杂度高,包含领域特定的术语,对文本分类模型提出了显著挑战。研究论文中复杂的语言和专门词汇要求模型具有对上下文和语义关系的深刻理解。

The ArXiv-10 dataset consists of titles and abstracts extracted from 100 thousand scientific papers on ArXiv, covering ten distinct research categories. These categories span subfields of computer science, physics, and mathematics. The dataset is downsampled to precisely 10 thousand samples per category to ensure consistency and manageability. This dataset provides a practical resource for researchers and practitioners interested in text classification tasks within the domain of scientific literature. It is characterized by high data complexity and domain-specific terminology, posing significant challenges for text classification models. The intricate language and specialized vocabulary found in research papers require models to have a deep understanding of context and semantic relationships.
提供机构:
effectiveML
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作