five

BABEL BRIEFINGS

收藏
arXiv2024-03-28 更新2024-07-23 收录
下载链接:
https://www.kaggle.com/datasets/felixludos/babel-briefings
下载链接
链接失效反馈
官方服务:
资源简介:
BABEL BRIEFINGS是一个包含470万条新闻标题的大型多语种数据集,涵盖30种语言和54个全球地点,时间跨度为2020年8月至2021年11月。该数据集由德国的马克斯·普朗克智能系统研究所创建,旨在支持自然语言处理和媒体研究。数据集内容丰富,包括每日新闻标题,原始语言文章及其英文翻译。创建过程中,通过News API每日收集数据,并使用Google Translate进行非英文文章的翻译。该数据集适用于训练语言模型、评估模型性能以及分析全球新闻报道和文化叙事,有助于解决语言障碍带来的数据整合难题。

BABEL BRIEFINGS is a large multilingual dataset consisting of 4.7 million news headlines, spanning 30 languages and 54 global locations, with a temporal coverage from August 2020 to November 2021. Developed by the Max Planck Institute for Intelligent Systems in Germany, this dataset is designed to support research in natural language processing and media studies. It includes comprehensive content: daily news headlines, original-language articles, and their corresponding English translations. During the dataset construction phase, data was collected daily through the News API, and non-English articles were translated using Google Translate. This dataset can be utilized for training language models, evaluating model performance, analyzing global news coverage and cultural narratives, and it aids in resolving data integration difficulties caused by language barriers.
提供机构:
马克斯·普朗克智能系统研究所
创建时间:
2024-03-28
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作