five

leey4n/KR3

收藏
Hugging Face2023-07-19 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/leey4n/KR3
下载链接
链接失效反馈
官方服务:
资源简介:
--- annotations_creators: [] language_creators: [] language: - ko license: - cc-by-nc-sa-4.0 multilinguality: - monolingual pretty_name: KR3 size_categories: - 100K<n<1m source_datasets: [] task_categories: - text-classification task_ids: - sentiment-classification --- ### KR3: Korean Restaurant Reviews with Ratings Korean sentiment classification dataset - Size: 460K(+180K) - Language: Korean-centric ### ⚠️ Caution with `Rating` Column 0 stands for negative review, 1 stands for positive review, and 2 stands for ambiguous review. **Note that rating 2 is not intended to be used directly for supervised learning(classification).** This data is included for additional pre-training purpose or other usage. In other words, this dataset is basically a **binary** sentiment classification task where labels are 0 and 1. ### 🔍 See More See all the codes for crawling/preprocessing the dataset and experiments with KR3 in [GitHub Repo](https://github.com/Wittgensteinian/kr3). See Kaggle dataset in [Kaggle Dataset](https://www.kaggle.com/ninetyninenewton/kr3-korean-restaurant-reviews-with-ratings). ### Usage ```python from datasets import load_dataset kr3 = load_dataset("leey4n/KR3", name='kr3', split='train') kr3 = kr3.remove_columns(['__index_level_0__']) # Original file didn't include this column. Suspect it's a hugging face issue. ``` ```python # drop reviews with ambiguous label kr3_binary = kr3.filter(lambda example: example['Rating'] != 2) ``` ### License **CC BY-NC-SA 4.0** ### Legal Issues We concluded that the **non-commerical usage and release of KR3 fall into the range of fair use (공정 이용)** stated in the Korean copyright act (저작권법). We further clarify that we **did not agree to the terms of service** from any websites which might prohibit web crawling. In other words, web crawling we've done was proceeded without logging in to the website. Despite all of these, feel free to contact to any of the contributors if you notice any legal issues. ### Contributors & Acknowledgement (Alphabetical order) [Dongin Jung](https://github.com/dongin1009) [Hyunwoo Kwak](https://github.com/Kwak-Hyun-woo) [Kaeun Lee](https://github.com/Kaeun-Lee) [Yejoon Lee](https://github.com/wittgensteinian) This work was done as DIYA 4기. Compute resources needed for the work was supported by [DIYA](https://blog.diyaml.com) and surromind.ai.
提供机构:
leey4n
原始信息汇总

数据集概述

基本信息

  • 名称: KR3
  • 语言: 韩语(Korean)
  • 许可证: CC BY-NC-SA 4.0
  • 多语言性: 单语种
  • 大小: 460K(+180K),属于100K<n<1m类别

任务类型

  • 任务类别: 文本分类
  • 任务ID: 情感分类

数据集结构

  • 情感标签: 0代表负面评价,1代表正面评价,2代表模糊评价。注意,标签2不直接用于监督学习(分类),主要用于额外的预训练或其他用途。

使用说明

  • 通过datasets库加载数据集,并移除不必要的列。
  • 过滤掉带有模糊标签的评论,以进行二元情感分类。

法律声明

  • 非商业使用和发布KR3属于韩国版权法中的合理使用范围。
  • 未同意任何可能禁止网络爬虫的网站的服务条款。
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作