RoBERTuito Training Dataset

arXiv2025-09-30 收录

下载链接：

https://github.com/finiteautomata/spritzer-tweets

下载链接

链接失效反馈

官方服务：

资源简介：

该数据集包含了大约5亿条西班牙语推文，这些推文是从一个更大的数据集中筛选出来的，专门用于预训练RoBERTuito模型。该数据集混合了多种语言，其中大约92%为西班牙语，4%为英语，3%为葡萄牙语，这样的组成使得它能够处理代码混合的文本。该数据集规模宏大，包含了大约5亿条来自大约43.2万名用户的推文，其任务是针对社交媒体文本进行语言模型的预训练。

This dataset contains approximately 500 million Spanish tweets, filtered from a larger corpus and specifically curated for pre-training the RoBERTuito model. It features a multilingual composition, with roughly 92% Spanish, 4% English, and 3% Portuguese, enabling it to handle code-mixed text. Boasting a massive scale, the dataset encompasses around 500 million tweets from approximately 432,000 users, and its core application is pre-training language models for social media text.

5,000+

优质数据集

54 个

任务类型

进入经典数据集