vic35get/licCorpus
收藏Hugging Face2025-08-02 更新2025-10-25 收录
下载链接:
https://hf-mirror.com/datasets/vic35get/licCorpus
下载链接
链接失效反馈官方服务:
资源简介:
Lic Corpus是一个专注于法律和招标领域的文本数据集,包含招标公告、公共合同和相关法律法规等文档。该数据集适用于自然语言处理模型的预训练,特别是针对招标和公共合同领域的PLN任务。数据来源于Comprasnet、TCE-PI、PNCP和GOV BR,涵盖了2012年至2023年的数据。数据集经过预处理,包括文本提取、格式清洗和样本划分等步骤,以提高数据的一致性和可读性。总共有460,036个文档,约1,492,319,518个token。训练集包含9,727,103个样本,测试集包含100,000个样本。
Lic Corpus is a text dataset focused on the legal and bidding domains, containing documents such as bidding notices, public contracts, and related legislation. This dataset is designed for pre-training language models, especially for NLP tasks in the context of bidding and public contracts. The data sources include Comprasnet, TCE-PI, PNCP, and GOV BR, covering data from 2012 to 2023. The dataset has undergone preprocessing, including text extraction, format cleaning, and sample splitting, to ensure consistency and readability. There are a total of 460,036 documents, approximately 1,492,319,518 tokens. The training set contains 9,727,103 samples, and the test set contains 100,000 samples.
提供机构:
vic35get



