Itau-Unibanco/aroeira
收藏Hugging Face2025-02-11 更新2024-12-14 收录
下载链接:
https://hf-mirror.com/datasets/Itau-Unibanco/aroeira
下载链接
链接失效反馈官方服务:
资源简介:
Aroeira 是一个专门为葡萄牙语训练大型语言模型而设计的精选语料库。现有研究大多集中在英语和汉语等高资源语言上,而低资源语言的高质量语料库相对匮乏。该数据集旨在缩小这一差距,推动葡萄牙语的前沿研究。该语料库由 Itaú-Unibanco 科技研究所 (ICTi) 的研究人员开发,包含约 100GB 的数据、3500 万份文档和 150 亿个词元。数据集采用 JSONL 格式,包含文本、URL、访问日期和 ID 四个字段,仅分为训练集,数据截止日期为 2023 年 12 月。该数据集适用于研究用途,并采用 cc-by-nc-4.0 许可证发布。
Aroeira is a curated corpus designed for training large language models in the Portuguese language. Most existing research focuses on high-resource languages like English and Chinese, with considerable efforts made to develop multilingual corpora. However, there is a pressing need to develop large datasets for lower-resource languages. This works aims to make this gap smaller contributing for the development of state-of-art research in Portugues. The final corpus is a result from the combinated work of researchers in the Instituto de Ciência e Tecnologia Itaú-Unibanco (ICTi) and the details about how it was made are described in the paper (paper link or name). The corpus creation is divided into two main objectives, (i) collect (Data Pipeline) and (ii) ensure content safety (Content Safety Pipeline). Our whole pipeline contains nine steps: data collection and sampling, text extraction, language identification, deduplication, and quality filters in Data Pipeline, and sexual content filter, toxic data filter, bias filter, and categorization in Content Safety Pipeline. Supported languages: Brazilian Portuguese (PT-BR) and Portuguese (PT-PT). Dataset Release Date: October, 2024. The current corpus version, released October 2024, contains approximately 100GB, 35 millions documents and 15 billions tokens. The corpus is published over the cc-by-nc-4.0 license. Aroeira is intended for research use in portuguese or multiple languages experiments setup. The use of the corpus for development of comercial products or any other comercicial use is not allowed. The corpus is saved as a jsonl (json line) file, where each line contains all the information for the respectively entry divide into 4 fields. The data is only split into train set. Data Freshness: The available data has a cutoff of December 2023. Aroeira is, in our knowlegde, the largest dataset available for Portuguese language.
提供机构:
Itau-Unibanco
搜集汇总
数据集介绍

以上内容由遇见数据集搜集并总结生成



