five

Indo4B

收藏
arXiv2020-10-08 更新2024-06-21 收录
下载链接:
https://indobenchmark.com/
下载链接
链接失效反馈
官方服务:
资源简介:
Indo4B是一个大规模的印尼语数据集,由印度尼西亚技术学院创建,旨在支持印尼自然语言理解(IndoNLU)的研究。该数据集包含约40亿个单词,覆盖了从社交媒体文本、博客、新闻到网站等多种公开可用来源。数据集的创建过程涉及从多个现有工作中收集数据,并标准化以确保可重复性。Indo4B的应用领域广泛,包括单句分类到对句序列标记等多种复杂度不同的任务,旨在解决印尼语在自然语言处理领域的资源稀缺问题。

Indo4B is a large-scale Indonesian language dataset developed by Institut Teknologi Indonesia, which aims to support research on Indonesian Natural Language Understanding (IndoNLU). It contains approximately 4 billion words, covering a wide array of publicly available sources including social media texts, blogs, news articles, and websites. The dataset was created by collecting data from multiple existing works and standardizing the collected data to ensure reproducibility. Indo4B supports a broad spectrum of tasks with varying levels of complexity, ranging from single-sentence classification to pair sequence tagging, and is designed to address the scarcity of resources for the Indonesian language in the field of natural language processing.
提供机构:
印度尼西亚技术学院
创建时间:
2020-09-11
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作