L3Cube-HingCorpus

Name: L3Cube-HingCorpus
Creator: L3Cube, 浦那
Published: 2022-04-19 00:49:59
License: 暂无描述

arXiv2022-04-19 更新2024-06-21 收录

下载链接：

https://github.com/l3cube-pune/code-mixed-nlp

下载链接

链接失效反馈

官方服务：

资源简介：

L3Cube-HingCorpus是由印度浦那的L3Cube创建的第一个大规模真实Hindi-English代码混合数据集，包含52.93M句子和1.04B tokens，数据来源于Twitter。该数据集通过使用Twint框架进行抓取，并经过预处理以移除非英语字符和用户提及，确保数据的隐私和准确性。数据集的创建旨在解决代码混合NLP任务中的数据稀缺问题，特别是在预训练大型语言模型方面。L3Cube-HingCorpus的应用领域包括代码混合情感分析、POS标记、NER和语言识别等，旨在提高这些任务的性能。

L3Cube-HingCorpus is the first large-scale real Hindi-English code-mixed dataset developed by L3Cube from Pune, India. It consists of 52.93 million sentences and 1.04 billion tokens, which were collected from Twitter. The dataset was crawled via the Twint framework, then preprocessed to eliminate non-English characters and user mentions, thus ensuring data privacy and accuracy. The creation of L3Cube-HingCorpus aims to alleviate the data scarcity problem in code-mixed natural language processing (NLP) tasks, especially for the pre-training of large language models. Application domains of L3Cube-HingCorpus cover code-mixed sentiment analysis, part-of-speech (POS) tagging, named entity recognition (NER), and language identification, with the goal of enhancing the performance of these downstream tasks.

提供机构：

L3Cube, 浦那

创建时间：

2022-04-19

5,000+

优质数据集

54 个

任务类型

进入经典数据集