five

NbAiLab/NCC

收藏
Hugging Face2025-03-10 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/NbAiLab/NCC
下载链接
链接失效反馈
官方服务:
资源简介:
挪威巨大语料库(NCC)是一个由多个较小挪威语料库组成的集合,适合用于训练大型语言模型。经过大量清洗后,这些数据集被统一格式化。NCC的总大小目前为30GB。该数据集的清洗是为了编码模型而优化的。如果正在构建解码模型,通常建议在清洗过程中更加严格。NCC包含了源信息、发布年份、语言置信度等元数据,以辅助进一步清洗。

The Norwegian Colossal Corpus (NCC) is a collection of multiple smaller Norwegian corpuses suitable for training large language models. After extensive cleaning, these datasets have been made available in a common format. The total size of the NCC is currently 30GB. The cleaning of the dataset is optimized for encoder models. If you are building a decoder model, it is usually recommended to be a bit stricter in the cleaning process. NCC includes metadata such as source, publication year, and language confidence to aid in this cleaning.
提供机构:
NbAiLab
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作