yasalma/tt-structured-content
收藏Hugging Face2026-02-28 更新2025-07-05 收录
下载链接:
https://hf-mirror.com/datasets/yasalma/tt-structured-content
下载链接
链接失效反馈官方服务:
资源简介:
该数据集包含了从1278个塔塔尔语文档中提取的结构化文本内容,这些文档原始格式为EPUB和PDF。文档包括书籍和其他长篇内容,具有丰富的格式化信息。该数据集旨在为自然语言处理任务、内容建模以及塔塔尔语言技术研究提供清洁、结构化和语义上有意义的内容。提取的Markdown保留了关键的结构元素,如标题、段落、图像、表格、目录、文档标题、脚注以及数学和化学公式(LaTeX格式)。为了提供干净、可用的文本,已经移除了页码、页眉和页脚等杂项元素。
This dataset contains structured textual content in Markdown format extracted from 1,278 Tatar-language documents, originally in EPUB and PDF formats. It includes books and other long-form content with rich formatting, aiming to provide clean, structured, and semantically meaningful content for natural language processing tasks, content modeling, and research in Tatar language technologies. The Markdown format preserves key structural elements such as headings, paragraphs, images, tables, table of contents, document titles, footnotes, and mathematical and chemical formulas in LaTeX format, with extraneous elements like page numbers and headers removed for a clean text.
提供机构:
yasalma



