NbAiLab/NCC

Name: NbAiLab/NCC
Creator: NbAiLab
Published: 2025-03-10 12:11:02
License: 暂无描述

Hugging Face2025-03-10 更新2024-03-04 收录

下载链接：

https://hf-mirror.com/datasets/NbAiLab/NCC

下载链接

链接失效反馈

官方服务：

资源简介：

挪威巨大语料库（NCC）是一个由多个较小挪威语料库组成的集合，适合用于训练大型语言模型。经过大量清洗后，这些数据集被统一格式化。NCC的总大小目前为30GB。该数据集的清洗是为了编码模型而优化的。如果正在构建解码模型，通常建议在清洗过程中更加严格。NCC包含了源信息、发布年份、语言置信度等元数据，以辅助进一步清洗。

The Norwegian Colossal Corpus (NCC) is a collection of multiple smaller Norwegian corpuses suitable for training large language models. After extensive cleaning, these datasets have been made available in a common format. The total size of the NCC is currently 30GB. The cleaning of the dataset is optimized for encoder models. If you are building a decoder model, it is usually recommended to be a bit stricter in the cleaning process. NCC includes metadata such as source, publication year, and language confidence to aid in this cleaning.

提供机构：

NbAiLab

5,000+

优质数据集

54 个

任务类型

进入经典数据集