five

D104_million_Southeast_Asian_language_news_text_dataset

收藏
Opencsg2026-03-04 更新2026-03-14 收录
下载链接:
https://www.opencsg.com/datasets/DatatangBeijing/D104_million_Southeast_Asian_language_news_text_dataset
下载链接
链接失效反馈
官方服务:
资源简介:
本数据集为东南亚多语种新闻数据,涵盖印尼语、马来语、泰语和越南语四种语言。数据总量超过3100万条,数据以JSONL格式存储,每条记录独立成行,便于高效读取与处理。数据来源广泛,涉及各类新闻主题,能够全面反映东南亚地区的社会动态、文化热点与经济趋势。本数据集可助力大模型提升多语言能力,丰富文化知识,优化性能,拓展东南亚行业应用,推动跨语言研究。

This is a multilingual news dataset for Southeast Asia, encompassing four languages: Indonesian, Malay, Thai and Vietnamese. It contains more than 31 million records, stored in JSONL format with each record as an independent line, enabling efficient reading and processing. The dataset draws from wide-ranging sources and covers various news topics, comprehensively reflecting the social dynamics, cultural hotspots and economic trends in Southeast Asia. This dataset can assist Large Language Models (LLMs) in improving their multilingual capabilities, enriching cultural knowledge, optimizing performance, expanding industry applications in Southeast Asia, and advancing cross-linguistic research.
创建时间:
2026-03-04
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作