five

baobabtech/test-eval-docs-docling-plain

收藏
Hugging Face2025-12-02 更新2026-01-03 收录
下载链接:
https://hf-mirror.com/datasets/baobabtech/test-eval-docs-docling-plain
下载链接
链接失效反馈
官方服务:
资源简介:
--- tags: - document-processing - docling - hierarchical-parsing - pdf-processing - generated --- # PDF Document Processing with Docling This dataset contains structured markdown extraction from PDFs in [baobabtech/test-eval-documents](https://huggingface.co/datasets/baobabtech/test-eval-documents) using Docling with hierarchical parsing. ## Processing Details - **Source Dataset**: [baobabtech/test-eval-documents](https://huggingface.co/datasets/baobabtech/test-eval-documents) - **Number of PDFs**: 20 - **Processing Time**: 8.4 minutes - **Processing Date**: 2025-12-02 15:40 UTC ### Configuration - **PDF Column**: `pdf_bytes` - **Dataset Split**: `train` ## Dataset Structure The dataset contains all original columns plus: - `original_md`: Markdown extracted by Docling (before hierarchical restructuring) - `hierarchical_md`: Markdown with proper heading hierarchy (after hierarchical processing) - `sections_toc`: Table of contents (one section per line, indented by level) - `inference_info`: JSON with processing metadata ## Usage ```python from datasets import load_dataset dataset = load_dataset("YOUR_DATASET_ID", split="train") for example in dataset: print(f"Document: {example.get('file_name', 'unknown')}") # Original markdown from Docling print("=== Original Markdown ===") print(example['original_md'][:500]) # Hierarchical markdown with proper heading levels print("\n=== Hierarchical Markdown ===") print(example['hierarchical_md'][:500]) # Table of contents print("\n=== Table of Contents ===") print(example['sections_toc']) break ```
提供机构:
baobabtech
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作