celtik3/NLP_Final_Project_ArXiv_Parsed2
收藏Hugging Face2025-04-11 更新2025-04-12 收录
下载链接:
https://hf-mirror.com/datasets/celtik3/NLP_Final_Project_ArXiv_Parsed2
下载链接
链接失效反馈官方服务:
资源简介:
该数据集包含了文本内容以及相关的元数据信息。具体特征包括:文本内容的Markdown格式字符串,PDF文档的元数据(包括ID、标题和分类),文档的标题信息(包括六级标题),以及文本块的类型信息。数据集被划分为训练集,大小为27822894字节,共有13122个示例。
The dataset includes text content and related metadata information. Specific features include: a Markdown formatted string of the text content, metadata of PDF documents (including ID, title, and categories), metadata of document headers (including up to six levels of headers), and metadata about the type of text chunks. The dataset is split into a training set, which is 27822894 bytes in size and contains 13122 examples.
提供机构:
celtik3



