celtik3/NLP_Final_Project_ArXiv_Parsed
收藏Hugging Face2025-04-11 更新2025-04-12 收录
下载链接:
https://hf-mirror.com/datasets/celtik3/NLP_Final_Project_ArXiv_Parsed
下载链接
链接失效反馈官方服务:
资源简介:
这个数据集包含了从文档中提取的Markdown文本、PDF元数据、标题元数据和文本块元数据。PDF元数据中记录了文档的唯一标识符、标题和分类信息,标题元数据则包含了文档中的五个不同级别的标题信息。文本块元数据则描述了文本块的类型。数据集划分为训练集,共有12985个示例,大小为27722461字节。
This dataset includes extracted Markdown text, PDF metadata, header metadata, and chunk metadata from documents. The PDF metadata records the documents unique identifier, title, and category information. Header metadata contains information about five different levels of headers in the document. Chunk metadata describes the type of text chunks. The dataset is split into a training set with a total of 12,985 examples and a size of 27,722,461 bytes.
提供机构:
celtik3



