vic35get/licCorpus

Name: vic35get/licCorpus
Creator: vic35get
Published: 2025-08-02 14:36:49
License: 暂无描述

Hugging Face2025-08-02 更新2025-10-25 收录

下载链接：

https://hf-mirror.com/datasets/vic35get/licCorpus

下载链接

链接失效反馈

官方服务：

资源简介：

Lic Corpus是一个专注于法律和招标领域的文本数据集，包含招标公告、公共合同和相关法律法规等文档。该数据集适用于自然语言处理模型的预训练，特别是针对招标和公共合同领域的PLN任务。数据来源于Comprasnet、TCE-PI、PNCP和GOV BR，涵盖了2012年至2023年的数据。数据集经过预处理，包括文本提取、格式清洗和样本划分等步骤，以提高数据的一致性和可读性。总共有460,036个文档，约1,492,319,518个token。训练集包含9,727,103个样本，测试集包含100,000个样本。

Lic Corpus is a text dataset focused on the legal and bidding domains, containing documents such as bidding notices, public contracts, and related legislation. This dataset is designed for pre-training language models, especially for NLP tasks in the context of bidding and public contracts. The data sources include Comprasnet, TCE-PI, PNCP, and GOV BR, covering data from 2012 to 2023. The dataset has undergone preprocessing, including text extraction, format cleaning, and sample splitting, to ensure consistency and readability. There are a total of 460,036 documents, approximately 1,492,319,518 tokens. The training set contains 9,727,103 samples, and the test set contains 100,000 samples.

提供机构：

vic35get

5,000+

优质数据集

54 个

任务类型

进入经典数据集