turkish-nlp-suite/Corona-mini

Name: turkish-nlp-suite/Corona-mini
Creator: turkish-nlp-suite
Published: 2023-09-20 15:04:26
License: 暂无描述

Hugging Face2023-09-20 更新2024-03-04 收录

下载链接：

https://hf-mirror.com/datasets/turkish-nlp-suite/Corona-mini

下载链接

链接失效反馈

官方服务：

资源简介：

Corona-mini数据集是一个土耳其语的小型语料库，包含关于新冠症状的评论。该语料库来源于两个Ekşisözlük标题的评论，包含178条原始评论和175条处理后的评论。数据集分为原始版本和轻度处理版本，处理版本移除了HTML标签、括号内的表达式和其他一些标签。数据集适用于摘要任务，属于社交媒体领域。

提供机构：

turkish-nlp-suite

原始信息汇总

数据集概述

基本信息

名称: Corona-mini
语言: 土耳其语
许可: CC-BY-SA-4.0
多语言性: 单语种
大小: 小于1K
任务类型: 摘要生成
美观名称: Corona-mini

数据集描述

领域: 社交媒体
数据来源: 两个Ekşisözlük标题，分别是“covid-19 belirtileri”和“gün gün koronavirüs belirtileri”
数据量: 包含178条原始评论和175条处理后的评论
语言: 所有评论均为土耳其语
版本: 提供原始和轻微处理两个版本

数据集实例

json { "text": "beni sarsmayan belirtilerdir, 2 doz biontech aşılıyım, 2. doz üzerinden 5 aydan çok geçmişti cuma : ayın 12 si akşamı açık havada az üşümeye maruz kaldım." }

数据分割

名称	训练数据量
Corona-mini	175

引用信息

支持: Google Developer Experts Program
引用文献: A Diverse Set of Freely Available Linguistic Resources for Turkish
引用格式: bibtex @inproceedings{altinok-2023-diverse, title = "A Diverse Set of Freely Available Linguistic Resources for {T}urkish", author = "Altinok, Duygu", booktitle = "Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)", month = jul, year = "2023", address = "Toronto, Canada", publisher = "Association for Computational Linguistics", url = "https://aclanthology.org/2023.acl-long.768", pages = "13739--13750", abstract = "This study presents a diverse set of freely available linguistic resources for Turkish natural language processing, including corpora, pretrained models and education material. Although Turkish is spoken by a sizeable population of over 80 million people, Turkish linguistic resources for natural language processing remain scarce. In this study, we provide corpora to allow practitioners to build their own applications and pretrained models that would assist industry researchers in creating quick prototypes. The provided corpora include named entity recognition datasets of diverse genres, including Wikipedia articles and supplement products customer reviews. In addition, crawling e-commerce and movie reviews websites, we compiled several sentiment analysis datasets of different genres. Our linguistic resources for Turkish also include pretrained spaCy language models. To the best of our knowledge, our models are the first spaCy models trained for the Turkish language. Finally, we provide various types of education material, such as video tutorials and code examples, that can support the interested audience on practicing Turkish NLP. The advantages of our linguistic resources are three-fold: they are freely available, they are first of their kind, and they are easy to use in a broad range of implementations. Along with a thorough description of the resource creation process, we also explain the position of our resources in the Turkish NLP world.", }

5,000+

优质数据集

54 个

任务类型

进入经典数据集