five

NaverHustQA/TVPL

收藏
Hugging Face2025-02-14 更新2025-11-01 收录
下载链接:
https://hf-mirror.com/datasets/NaverHustQA/TVPL
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: mit task_categories: - text-generation language: - vi tags: - legal size_categories: - 1B<n<10B --- # TVPL (thuvienphapluat.vn) 2. `structured_data_doc.parquet`: preprocessed version from only needed doc from tvpl . Please read by Datasets library 3. `parent_nodes.parquet`: parent nodes from [1] by chunking with `SentenceSplitter, chunk_overlap=0, chunk_size=800, tokenizer="Viet-Mistral/Vistral-7B-Chat"` 4. `child_nodes.parquet`: child nodes from [2] by chunking with `SentenceSplitter, chunk_overlap=30, chunk_size=190` and using Word Segmentation, vietnamese-bi-encoder # Dedup WARNING: IF YOU TRAIN PLEASE ONLY CONSIDER `dedup/filtered_corpus.parquet` and `dedup/newtraintestdivide/filtered_corpus.parquet` 1. `filtered_corpus.parquet`: Merged dataset [SFT-Law]+[TVPL-structured]+[Zalo-corpus]. Clustered and then filtered. Items with the longest text are kept. - `text`: str - `oid`: Item's unique ID among 3 merged dataset, int - `__cluster__`: int - `dataset`: Original dataset where the item is extracted from, str - `__cluster_member__`: oid of cluster members from the original 3. eg: ``` { "text": "Thông tư này hướng dẫn tuần...", "dataset": "zalo_legal_corpus", "oid": 0, "__cluster__": 0 } ``` 2. newtraintestdivide folder contains newly divided SFT_law with 10k tests. `oid` and `__cluster__` has the same meaning as in `filtered_corpus.parquet`. 3. `data_remapped/{file_name}.parquet`: data files taken from other repoes, added 2 fields: `oid` (Item's unique ID among 3 merged dataset, int) and `__cluster__` (Cluster ID, int). - sft_test+sft_train taken from [SFT-Law] ``` { "reference": [ "https://thuvienphapluat.vn/..." ], "answer": "Sinh viên học nghệ thuật ca trù ....", "question": "Sinh viên học nghệ thuật ca trù tại ...?", "domain": ["Tài chính nhà nước"], "text": "Căn cứ tại khoản 1 Điều 3 T....", "oid": 61425, "__cluster__": 219110 } ``` - tvpl_dataset taken from [TVPL-structured] (only the `structured_data_doc.parquet` file) ``` { "meta_data": { "base": "Căn cứ Luật Tổ chức Chính phủ ngày 19 tháng 6 năm ...", "content": "QUY ĐỊNH XỬ PHẠT VI PHẠM HÀNH CHÍNH...", "date": "31/12/2021", "department": "Chính phủ", "doc_type": "Nghị định", "file_name": "Decree_No._139_2021_ND-CP_dated_December_31,_2021_.json", "id_doc": "139/2021/ND-CP", "location": "Hà Nội", "title": "Decree No. 139/2021/ND-CP dated December 31, 2021 on Administrative penalties for inland waterway navigation violations", "updated": 1710417820 }, "child_data": [ { "__cluster__": 253403, "header": [ "Decree No. 139/2021/ND-CP dated...", "Chương I. NHỮNG QUY ĐỊNH CHUNG" ], "len_tokenizer": 157, "lower_segmented_text": "điều 1 . phạm_vi điều_chỉnh...", "oid": 232323, "pointer_link": ["Chương I", "Điều 1"], "text": "Điều 1. Phạm vi điều chỉnh\n1. ..." } ] } ``` - zalo_legal_corpus from [Zalo-corpus] (only the ` legal_corpus.json` file). `law_id` is append to articles. ``` { "article_id": "1", "text": "Thông tư này hướng dẫn tuần tra,...", "title": "Điều 1. Phạm vi áp dụng", "law_id": "01/2009/tt-bnn", "oid": 0, "__cluster__": 0 } ``` Please refer to the corresponding repositories for document on the data. [SFT-Law]: https://huggingface.co/datasets/bkai-foundation-models/SFT-Law [TVPL-structured]:https://huggingface.co/datasets/bkai-foundation-models/TVPL/blob/main/structured_data_doc.parquet [Zalo-corpus]: https://huggingface.co/datasets/bkai-foundation-models/zalo_legal_2021/blob/main/original/legal_corpus.json
提供机构:
NaverHustQA
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作