AlirezaF138/LSCP-Dataset

Name: AlirezaF138/LSCP-Dataset
Creator: AlirezaF138
Published: 2024-10-31 10:14:56
License: 暂无描述

Hugging Face2024-10-31 更新2024-12-14 收录

下载链接：

https://hf-mirror.com/datasets/AlirezaF138/LSCP-Dataset

下载链接

链接失效反馈

官方服务：

资源简介：

Enhanced Large Scale Colloquial Persian Language Understanding (LSCP)数据集是一个用于波斯语非正式语言处理的大规模语料库，特别针对低资源语言的NLP挑战。该数据集包含1.2亿个句子，源自2700万条波斯推文，并提供了解析树、词性标注、情感极性和多语言翻译（英语、德语、捷克语、意大利语和印地语）等注释。数据集特别关注波斯语的口语特征，如非正式缩写、词汇量有限和语音变化。数据收集通过Twitter API进行，经过自动注释和人工验证两个阶段。数据集采用CC BY-NC-ND 4.0许可证，允许非商业用途，但禁止重新分发和修改。

The Enhanced Large Scale Colloquial Persian (LSCP) dataset is designed for colloquial Persian language processing, aiming to address challenges in low-resource languages in NLP. The dataset includes 120 million sentences derived from 27 million Persian tweets, with annotations such as parsing trees, part-of-speech tags, sentiment polarity, and multilingual translations (including English, German, Czech, Italian, and Hindi). The current version of the dataset includes only Persian and English portions, while the original dataset includes additional languages. The dataset is released under the CC BY-NC-ND 4.0 license, allowing non-commercial use with attribution.

提供机构：

AlirezaF138

5,000+

优质数据集

54 个

任务类型

进入经典数据集