five

L3Cube-HingCorpus

收藏
arXiv2022-04-19 更新2024-06-21 收录
下载链接:
https://github.com/l3cube-pune/code-mixed-nlp
下载链接
链接失效反馈
官方服务:
资源简介:
L3Cube-HingCorpus是由印度浦那的L3Cube创建的第一个大规模真实Hindi-English代码混合数据集,包含52.93M句子和1.04B tokens,数据来源于Twitter。该数据集通过使用Twint框架进行抓取,并经过预处理以移除非英语字符和用户提及,确保数据的隐私和准确性。数据集的创建旨在解决代码混合NLP任务中的数据稀缺问题,特别是在预训练大型语言模型方面。L3Cube-HingCorpus的应用领域包括代码混合情感分析、POS标记、NER和语言识别等,旨在提高这些任务的性能。

L3Cube-HingCorpus is the first large-scale real Hindi-English code-mixed dataset developed by L3Cube from Pune, India. It consists of 52.93 million sentences and 1.04 billion tokens, which were collected from Twitter. The dataset was crawled via the Twint framework, then preprocessed to eliminate non-English characters and user mentions, thus ensuring data privacy and accuracy. The creation of L3Cube-HingCorpus aims to alleviate the data scarcity problem in code-mixed natural language processing (NLP) tasks, especially for the pre-training of large language models. Application domains of L3Cube-HingCorpus cover code-mixed sentiment analysis, part-of-speech (POS) tagging, named entity recognition (NER), and language identification, with the goal of enhancing the performance of these downstream tasks.
提供机构:
L3Cube, 浦那
创建时间:
2022-04-19
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作