tmquan/anle-toaan-gov-vn
收藏Hugging Face2026-04-26 更新2026-05-03 收录
下载链接:
https://hf-mirror.com/datasets/tmquan/anle-toaan-gov-vn
下载链接
链接失效反馈官方服务:
资源简介:
Án lệ — 越南法律判例数据集包含从越南最高法院官方Án lệ(法律判例)门户网站anle.toaan.gov.vn抓取的1963个判例决定。每个判例提供原始PDF、解析后的markdown文本、结构化JSON记录(包含实体、法规引用、适用条款、采纳日期)、2048维密集嵌入向量以及2D投影(PCA/t-SNE/UMAP + HDBSCAN聚类ID)。数据集通过NeMo Curator管道端到端生成,涵盖下载、解析、提取、嵌入和降维五个阶段。数据集配置包括parse(解析后的markdown文本)、extract(文本加结构化法律提取)、embed(2048维密集向量)和reduce(预计算的PCA/t-SNE/UMAP坐标加HDBSCAN聚类ID)四个部分,适用于文本分类、文本检索、句子相似度和特征提取等任务。数据集涵盖1952年至2025年的判例,主要集中在2017年之后,涉及民事、刑事、行政、商业、家庭等多个法律领域,大多数决定为上诉或再审级别。
Án lệ — Vietnamese Legal Precedents dataset contains 1 963 case decisions scraped from the official Vietnamese Án lệ (legal precedent) portal of the Supreme Peoples Court at anle.toaan.gov.vn. Each precedent is provided as a raw PDF, parsed markdown, a structured JSON record (entities, statute references, applied article, adoption date), a 2 048-dim dense embedding, and a 2-D projection (PCA / t-SNE / UMAP + HDBSCAN cluster id). The corpus was produced end-to-end by the NeMo Curator pipeline through stages: download → parse → extract → embed → reduce. The dataset configurations include parse (markdown body), extract (text + structured legal extraction), embed (2 048-dim dense vectors), and reduce (pre-computed PCA / t-SNE / UMAP coordinates + HDBSCAN cluster id), suitable for tasks like text-classification, text-retrieval, sentence-similarity, and feature-extraction. The corpus spans from 1952 to 2025, with volume concentrated post-2017, covering legal sectors such as civil, criminal, administrative, commercial, and family law, with most decisions at appellate or cassation levels.
提供机构:
tmquan



