five

spanish-tweets

收藏
魔搭社区2025-12-05 更新2025-08-30 收录
下载链接:
https://modelscope.cn/datasets/pysentimiento/spanish-tweets
下载链接
链接失效反馈
官方服务:
资源简介:
# spanish-tweets ## A big corpus of tweets for pretraining embeddings and language models ## Table of Contents - [Table of Contents](#table-of-contents) - [Dataset Description](#dataset-description) - [Dataset Summary](#dataset-summary) - [Supported Tasks and Leaderboards](#supported-tasks-and-leaderboards) - [Languages](#languages) - [Dataset Creation](#dataset-creation) - [Curation Rationale](#curation-rationale) - [Source Data](#source-data) - [Additional Information](#additional-information) - [Dataset Curators](#dataset-curators) - [Licensing Information](#licensing-information) - [Citation Information](#citation-information) - [Contributions](#contributions) ## Dataset Description - **Homepage**: https://github.com/pysentimiento/robertuito - **Paper**: [RoBERTuito: a pre-trained language model for social media text in Spanish](https://aclanthology.org/2022.lrec-1.785/) - **Point of Contact:** jmperez (at) dc.uba.ar ### Dataset Summary A big dataset of (mostly) Spanish tweets for pre-training language models (or other representations). ### Supported Tasks and Leaderboards Language Modeling ### Languages Mostly Spanish, but some Portuguese, English, and other languages. ## Dataset Structure ### Data Fields - *tweet_id*: id of the tweet - *user_id*: id of the user - *text*: text from the tweet ## Dataset Creation The full process of data collection is described in the paper. Here we roughly outline the main points: - A Spritzer collection uploaded to Archive.org dating from May 2019 was downloaded - From this, we only kept tweets with language metadata equal to Spanish, and mark the users who posted these messages. - Then, the tweetline from each of these marked users was downloaded. This corpus consists of 622M tweets from around 432K users. Please note that we did not filter tweets from other languages, so you might find English, Portuguese, Catalan and other languages in the dataset (around 7/8% of the tweets are not in Spanish) ### Citation Information ``` @inproceedings{perez-etal-2022-robertuito, title = "{R}o{BERT}uito: a pre-trained language model for social media text in {S}panish", author = "P{\'e}rez, Juan Manuel and Furman, Dami{\'a}n Ariel and Alonso Alemany, Laura and Luque, Franco M.", booktitle = "Proceedings of the Thirteenth Language Resources and Evaluation Conference", month = jun, year = "2022", address = "Marseille, France", publisher = "European Language Resources Association", url = "https://aclanthology.org/2022.lrec-1.785", pages = "7235--7243", abstract = "Since BERT appeared, Transformer language models and transfer learning have become state-of-the-art for natural language processing tasks. Recently, some works geared towards pre-training specially-crafted models for particular domains, such as scientific papers, medical documents, user-generated texts, among others. These domain-specific models have been shown to improve performance significantly in most tasks; however, for languages other than English, such models are not widely available. In this work, we present RoBERTuito, a pre-trained language model for user-generated text in Spanish, trained on over 500 million tweets. Experiments on a benchmark of tasks involving user-generated text showed that RoBERTuito outperformed other pre-trained language models in Spanish. In addition to this, our model has some cross-lingual abilities, achieving top results for English-Spanish tasks of the Linguistic Code-Switching Evaluation benchmark (LinCE) and also competitive performance against monolingual models in English Twitter tasks. To facilitate further research, we make RoBERTuito publicly available at the HuggingFace model hub together with the dataset used to pre-train it.", } ```

# spanish-tweets ## 用于预训练词嵌入与大语言模型的大规模推文语料库 ## 目录 - [目录](#table-of-contents) - [数据集描述](#dataset-description) - [数据集概况](#dataset-summary) - [支持任务与评测榜单](#supported-tasks-and-leaderboards) - [语言分布](#languages) - [数据集构建](#dataset-creation) - [数据集遴选依据](#curation-rationale) - [源数据](#source-data) - [附加信息](#additional-information) - [数据集维护者](#dataset-curators) - [授权信息](#licensing-information) - [引用信息](#citation-information) - [贡献说明](#contributions) ## 数据集描述 - **主页**:https://github.com/pysentimiento/robertuito - **相关论文**:[RoBERTuito:面向西班牙语社交媒体文本的预训练语言模型](https://aclanthology.org/2022.lrec-1.785/) - **联系方式**:jmperez (at) dc.uba.ar ### 数据集概况 本数据集为大规模推文语料库,其中绝大多数为西班牙语推文,可用于预训练大语言模型(或其他表征模型)。 ### 支持任务与评测榜单 语言建模(Language Modeling) ### 语言分布 本数据集以西班牙语为主,同时包含少量葡萄牙语、英语及其他语言。 ## 数据集结构 ### 数据字段 - *tweet_id*:推文ID(tweet_id) - *user_id*:用户ID(user_id) - *text*:推文文本内容(text) ## 数据集构建 完整的数据收集流程已在论文中详述,此处仅简要概述核心步骤: - 下载了2019年5月上传至Archive.org的Spritzer数据集 - 从中仅保留语言元数据标注为西班牙语的推文,并标记发布这些推文的用户 - 随后下载所有被标记用户的推文时间线 该语料库包含约43.2万用户发布的6.22亿条推文。 请注意,本数据集未对其他语言的推文进行过滤,因此可能包含英语、葡萄牙语、加泰罗尼亚语等其他语言的推文(约7%至8%的推文并非西班牙语)。 ### 引用信息 @inproceedings{perez-etal-2022-robertuito, title = "{R}o{BERT}uito: 面向西班牙语社交媒体文本的预训练语言模型", author = "Pérez, Juan Manuel and Furman, Damián Ariel and Alonso Alemany, Laura and Luque, Franco M.", booktitle = "第十三届语言资源与评估会议论文集", month = jun, year = "2022", address = "法国马赛", publisher = "欧洲语言资源协会", url = "https://aclanthology.org/2022.lrec-1.785", pages = "7235-7243页", abstract = "自BERT问世以来,Transformer大语言模型与迁移学习便成为自然语言处理任务的主流技术方案。近期诸多研究致力于为特定领域定制预训练模型,例如学术论文、医疗文档、用户生成文本等领域。这类领域专属模型已被证实可在多数任务中显著提升性能;然而针对非英语语言的此类模型尚未得到广泛应用。本工作提出RoBERTuito,一款面向西班牙语用户生成文本的预训练语言模型,其训练数据来自超过5亿条推文。在面向用户生成文本任务的基准测试中,RoBERTuito的表现优于其他西班牙语预训练语言模型。此外,该模型具备一定跨语言能力,在语言代码转换评测基准(LinCE)的英西语任务中取得顶尖结果,同时在英语推特任务中的表现可与单语模型相媲美。为推动后续研究,我们将RoBERTuito与预训练所用数据集一同公开至HuggingFace模型仓库。", }
提供机构:
maas
创建时间:
2025-08-29
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作