Twitter LID Datasets
收藏arXiv2019-10-15 更新2024-06-21 收录
下载链接:
https://github.com/duytinvo/LID_NN
下载链接
链接失效反馈官方服务:
资源简介:
本研究创建了一个基于Twitter数据的大规模语言识别数据集,包含超过1800万条标注推文,涵盖54种语言。数据集通过自动标注技术构建,旨在为语言识别系统提供标准化的基准测试。数据集分为小、中、大三个规模,分别针对不同的语言数量和推文数量进行平衡,以适应不同深度学习模型的训练需求。该数据集主要用于解决社交媒体中短文本的语言识别问题,特别是在处理噪声和多语言混合文本时的挑战。
This study developed a large-scale language identification dataset based on Twitter data, which contains over 18 million annotated tweets covering 54 languages. Constructed via automatic annotation techniques, this dataset aims to provide a standardized benchmark for language identification systems. It is divided into three scales: small, medium, and large, with each scale balanced in terms of the number of languages and tweet volume to suit the training requirements of different deep learning models. This dataset is primarily used to address the challenges of short-text language identification in social media, especially when dealing with noisy and multilingual mixed texts.
提供机构:
未提及
创建时间:
2019-10-15



