NLPnorth/snakmodel-pretraining-data-v0.1
收藏Hugging Face2025-04-05 更新2025-04-12 收录
下载链接:
https://hf-mirror.com/datasets/NLPnorth/snakmodel-pretraining-data-v0.1
下载链接
链接失效反馈官方服务:
资源简介:
SnakModel是一个为丹麦语设计的73亿参数自回归语言模型,旨在处理丹麦语相关任务。该模型基于Llama 2模型,并由NLPnorth和AAU-NLP两个研究单位共同开发。数据集包含了文本和来源信息,总共约有4亿3千万条训练样本。在数据预处理过程中,去除了来源不明确的DaNewsroom和FTSpeech两部分数据。
SnakModel is a 7B-parameter autoregressive language model specifically designed for Danish, aiming to handle Danish-related tasks. The model is based on Llama 2 and developed jointly by the NLPnorth and AAU-NLP research units. The dataset contains text and source information, with a total of about 430 million training samples. During data preprocessing, the parts from DaNewsroom and FTSpeech with unclear sources were removed.
提供机构:
NLPnorth



