tedbelford/OTA-dataset

Name: tedbelford/OTA-dataset
Creator: tedbelford
Published: 2025-12-09 07:38:02
License: 暂无描述

Hugging Face2025-12-09 更新2025-12-20 收录

下载链接：

https://hf-mirror.com/datasets/tedbelford/OTA-dataset

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: mit language: - vi size_categories: - 1K<n<10K --- ## Dataset Details ### Dataset Description The Vietnamese Administrative Instruction Dataset is a large-scale benchmark dataset designed for the task of Organization-Task relationship extraction from Vietnamese administrative documents. It addresses the specific challenges of extracting direct responsibilities (“who does what”) from high-density, complex hierarchical documents. The dataset focuses on "directives" (*Chi thị*), "decisions," and "official dispatches" that exhibit high structural variability and textual density. It was created to facilitate the development of hybrid information extraction pipelines that combine OCR with Large Language Model (LLM) reasoning. ### Dataset Sources * **Paper:** Contextual Grounding and Iterative Refinement: A Hybrid Framework for Reliable Organization-Task Extraction in Vietnamese Administrative Documents. * **Source Data Origin:** National Database of Legal Normative Documents (VBPL - *Van ban phap luat*). ## Uses ### Direct Use * **Relation Extraction:** Extracting pairs of (Organization, Task) where the organization is explicitly assigned direct execution responsibility. * **OCR Benchmarking:** Testing Optical Character Recognition (OCR) systems on high-density administrative layouts with complex hierarchies and tonal marks. * **LLM Grounding:** Evaluating the ability of Large Language Models to avoid hallucination by anchoring outputs to specific text spans. ### Out-of-Scope Use * **Indirect Stakeholder Identification:** The dataset excludes indirect stakeholders or entities that only receive task outcomes. * **General Domain Text:** Models trained on this dataset are specialized for legal-administrative text and may not generalize to other domains. ## Dataset Structure The dataset consists of PDF documents and their corresponding annotated Organization-Task relationships. ### Statistics * **Total Documents:** 1,409 PDF files. * **Total Pages:** 5,812 pages. * **Total Data Size:** 217.82 MB. * **Total Text Volume:** ~1.95 million words. * **Total Annotations:** * 18,368 total task entries. * 18,359 unique (Organization, Task) pairs. * **Entity Statistics:** * **Unique Organizations:** 10,458. * **Unique Action Verbs:** 4,554. ### Document Morphology * **Length:** Documents range from single-page notices to reports up to 23 pages, with an average length of 4.12 pages. * **Density:** The average word count is 448.1 words per page, with peak cases reaching 2,291 words per page. * **Vertical Coverage:** Mean page coverage is 77.42%, indicating dense layouts with minimal whitespace. ## Dataset Creation ### Curation Rationale Administrative documents are the backbone of organizational operations, yet extracting “who does what” remains a manual bottleneck. Existing resources for the Vietnamese legal-administrative domain are scarce. This dataset was created to fill that gap and address common failure modes of LLMs such as hallucination and grounding loss when processing long, dense administrative texts. ### Source Data #### Data Collection and Processing The corpus was curated from the official National Database of Legal Normative Documents (vbpl.vn), covering a temporal range from 1955 to 2015. * **Format:** PDF files of directive documents (*Chi thị*). * **Segmentation:** The data was split into 194 disjoint sets, normalized by page count (approximately 30–40 pages per set) to manage annotator workload. * **Physical Standardization:** Pages were standardized to dimensions of (612 \times 792) pixels.

license: MIT协议 language: - 越南语 size_categories: - 1000 < 样本量 < 10000 ## 数据集详情 ### 数据集描述越南行政指令数据集（Vietnamese Administrative Instruction Dataset）是专为从越南语行政文档中抽取组织-任务（Organization-Task）关系而打造的大规模基准数据集，旨在解决从高密度、复杂层级结构文档中提取直接履职职责（即“谁做什么”）的特定挑战。本数据集聚焦于结构变异性与文本密度均较高的指令（*Chi thị*）、决定与官方公文，旨在推动结合光学字符识别（Optical Character Recognition, OCR）与大语言模型（Large Language Model, LLM）推理的混合信息抽取流水线开发。 ### 数据集来源 * **论文**：《语境锚定与迭代优化：面向越南语行政文档中可靠组织-任务抽取的混合框架》 * **源数据来源**：法律规范性文件国家数据库（National Database of Legal Normative Documents, VBPL - *Van ban phap luat*） ## 使用场景 ### 直接使用场景 * **关系抽取**：抽取被明确赋予直接执行职责的（机构，任务）对 * **OCR基准测试**：针对带有复杂层级结构与声调标记的高密度行政版面，测试光学字符识别（OCR）系统性能 * **大语言模型锚定测试**：评估大语言模型（LLM）通过将输出锚定至特定文本片段以避免产生幻觉的能力 ### 不适用场景 * **间接利益相关方识别**：本数据集未涵盖仅接收任务成果的间接利益相关方或实体 * **通用领域文本适配**：基于本数据集训练的模型仅针对法律-行政文本专项优化，无法泛化至其他领域 ## 数据集结构本数据集由PDF文档及其对应的标注组织-任务关系组成。 ### 统计信息 * **总文档数**：1409份PDF文件 * **总页数**：5812页 * **总数据量**：217.82 MB * **总文本量**：约195万字 * **总标注项**： * 共18368条任务条目 * 18359组唯一（机构，任务）对 * **实体统计**： * **唯一机构数**：10458个 * **唯一动作动词数**：4554个 ### 文档形态特征 * **文档长度**：文档涵盖从单页通知到最长23页的报告，平均长度为4.12页 * **文本密度**：单页平均字数为448.1字，峰值可达2291字/页 * **垂直版面占比**：平均页面内容占比为77.42%，表明版面紧凑，留白极少 ## 数据集构建 ### 构建初衷行政文档是组织运营的核心支撑，但抽取“谁做什么”仍依赖人工操作，成为效率瓶颈。当前越南语法律-行政领域的公开资源较为匮乏，本数据集旨在填补这一空白，并解决大语言模型（LLM）在处理长文本、高密度行政文档时常见的幻觉与锚定失效等问题。 ### 源数据 #### 数据收集与处理本语料库从官方法律规范性文件国家数据库（vbpl.vn）采集，时间跨度为1955年至2015年。 * **格式**：指令文档（*Chi thị*）的PDF文件 * **数据分段**：数据集被划分为194个互不重叠的子集，按页数标准化（每个子集约30~40页）以管控标注人员的工作量 * **页面标准化**：所有页面统一调整为612×792像素的尺寸

提供机构：

tedbelford

5,000+

优质数据集

54 个

任务类型

进入经典数据集