five

turkish-nlp-suite/Corona-mini

收藏
Hugging Face2023-09-20 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/turkish-nlp-suite/Corona-mini
下载链接
链接失效反馈
官方服务:
资源简介:
Corona-mini数据集是一个土耳其语的小型语料库,包含关于新冠症状的评论。该语料库来源于两个Ekşisözlük标题的评论,包含178条原始评论和175条处理后的评论。数据集分为原始版本和轻度处理版本,处理版本移除了HTML标签、括号内的表达式和其他一些标签。数据集适用于摘要任务,属于社交媒体领域。

Corona-mini数据集是一个土耳其语的小型语料库,包含关于新冠症状的评论。该语料库来源于两个Ekşisözlük标题的评论,包含178条原始评论和175条处理后的评论。数据集分为原始版本和轻度处理版本,处理版本移除了HTML标签、括号内的表达式和其他一些标签。数据集适用于摘要任务,属于社交媒体领域。
提供机构:
turkish-nlp-suite
原始信息汇总

数据集概述

基本信息

  • 名称: Corona-mini
  • 语言: 土耳其语
  • 许可: CC-BY-SA-4.0
  • 多语言性: 单语种
  • 大小: 小于1K
  • 任务类型: 摘要生成
  • 美观名称: Corona-mini

数据集描述

  • 领域: 社交媒体
  • 数据来源: 两个Ekşisözlük标题,分别是“covid-19 belirtileri”和“gün gün koronavirüs belirtileri”
  • 数据量: 包含178条原始评论和175条处理后的评论
  • 语言: 所有评论均为土耳其语
  • 版本: 提供原始和轻微处理两个版本

数据集实例

json { "text": "beni sarsmayan belirtilerdir, 2 doz biontech aşılıyım, 2. doz üzerinden 5 aydan çok geçmişti cuma : ayın 12 si akşamı açık havada az üşümeye maruz kaldım." }

数据分割

名称 训练数据量
Corona-mini 175

引用信息

  • 支持: Google Developer Experts Program
  • 引用文献: A Diverse Set of Freely Available Linguistic Resources for Turkish
  • 引用格式: bibtex @inproceedings{altinok-2023-diverse, title = "A Diverse Set of Freely Available Linguistic Resources for {T}urkish", author = "Altinok, Duygu", booktitle = "Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)", month = jul, year = "2023", address = "Toronto, Canada", publisher = "Association for Computational Linguistics", url = "https://aclanthology.org/2023.acl-long.768", pages = "13739--13750", abstract = "This study presents a diverse set of freely available linguistic resources for Turkish natural language processing, including corpora, pretrained models and education material. Although Turkish is spoken by a sizeable population of over 80 million people, Turkish linguistic resources for natural language processing remain scarce. In this study, we provide corpora to allow practitioners to build their own applications and pretrained models that would assist industry researchers in creating quick prototypes. The provided corpora include named entity recognition datasets of diverse genres, including Wikipedia articles and supplement products customer reviews. In addition, crawling e-commerce and movie reviews websites, we compiled several sentiment analysis datasets of different genres. Our linguistic resources for Turkish also include pretrained spaCy language models. To the best of our knowledge, our models are the first spaCy models trained for the Turkish language. Finally, we provide various types of education material, such as video tutorials and code examples, that can support the interested audience on practicing Turkish NLP. The advantages of our linguistic resources are three-fold: they are freely available, they are first of their kind, and they are easy to use in a broad range of implementations. Along with a thorough description of the resource creation process, we also explain the position of our resources in the Turkish NLP world.", }
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作