five

The Nordic Pile

收藏
arXiv2023-03-30 更新2024-08-06 收录
下载链接:
http://arxiv.org/abs/2303.17183v1
下载链接
链接失效反馈
官方服务:
资源简介:
The Nordic Pile是一个包含1.2TB文本的高质量多语言数据集,由AI Sweden创建,旨在支持北欧语言(丹麦语、冰岛语、挪威语和瑞典语)以及英语的大型语言模型(LLMs)的开发。数据集内容丰富,涵盖多种语言风格和使用场景,包括学术文章、书籍、代码、对话论坛、数学问题等。创建过程中,数据收集自多种来源,包括Common Crawl的衍生数据和其他公开预处理数据集,确保了数据的多样性和广泛性。该数据集经过严格的收集、清洗和过滤过程,以构建高性能的北欧LLMs,适用于解决北欧语言处理中的各种问题。

The Nordic Pile is a high-quality multilingual dataset containing 1.2 TB of text. It was created by AI Sweden, with the goal of supporting the development of large language models (LLMs) for Nordic languages including Danish, Icelandic, Norwegian, Swedish, as well as English. The dataset boasts rich content covering diverse linguistic styles and usage scenarios, such as academic articles, books, code, discussion forums, mathematical problems, and more. During its creation, data was collected from multiple sources, including derivative datasets from Common Crawl and other publicly available preprocessed datasets, ensuring the diversity and breadth of the dataset. This dataset has undergone rigorous collection, cleaning and filtering processes to build high-performance Nordic LLMs that are suitable for addressing various issues in Nordic language processing.
提供机构:
AI Sweden
创建时间:
2023-03-30
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作