five

turkish-nlp-suite/BuyukSinema

收藏
Hugging Face2024-11-01 更新2025-04-12 收录
下载链接:
https://hf-mirror.com/datasets/turkish-nlp-suite/BuyukSinema
下载链接
链接失效反馈
官方服务:
资源简介:
--- annotations_creators: - Duygu Altinok language: - tr license: - cc-by-sa-4.0 multilinguality: - monolingual size_categories: - 10K<n<100K source_datasets: - original task_categories: - text-classification task_ids: - sentiment-classification pretty_name: BuyukSinema tags: - sentiment dataset_info: features: - name: text dtype: string - name: label dtype: class_label: names: '0': nono '1': nowatch '2': horrible '3': poor '4': bad '5': middle '6': good '7': great '8': super '9': amazing splits: - name: train num_bytes: 46979645 num_examples: 67328 - name: validation num_bytes: 733500 num_examples: 10000 - name: test num_bytes: 742661 num_examples: 10000 download_size: 58918801 data_files: - split: train path: movies/train-* - split: validation path: movies/validation-* - split: test path: movies/test-* --- # BüyükSinema - A Large Scale Turkish Movie Reviews Sentiment Dataset <img src="https://raw.githubusercontent.com/turkish-nlp-suite/.github/main/profile/buyuksinema.png" width="30%" height="30%"> ## Dataset Summary BüyükSinema is a Turkish movie reviews dataset of size 87K, scraped from Sinefil.com and Beyazperde.com. Hence this dataset is a superset of [BeyazPerde All Movie Reviews](https://huggingface.co/datasets/turkish-nlp-suite/beyazperde-all-movie-reviews), [BeyazPerde Top 300 Movie Reviews](https://huggingface.co/datasets/turkish-nlp-suite/beyazperde-top-300-movie-reviews) and [Sinefil Movie Reviews](https://huggingface.co/datasets/turkish-nlp-suite/sinefil-movie-reviews) datasets. This is a merge of the three different datasets from two resources, hence we scaled the output stars into the range of 1-10 accordingly. The star distribution is as follows: | star rating | count | |---|---| | 1 | 5,657 | | 2 | 3,092 | | 3 | 2,172 | | 4 | 3,491 | | 5 | 7,349 | | 6 | 9,078 | | 7 | 15,647 | | 8 | 21,154 | | 9 | 10,868 | | 10 | 8,820 | | total | 87,328 | The star distribution is quite skewed towards 7+ stars. For more information about dataset statistics, please refer to the [research paper](). ## Dataset Instances An instance looks like: ``` { "text":"Mükemmelin ötesinde bir şey. Helal olsun. Devamını da isteriz artık... Emeğinize Yüreğinize Sağlık...", "label":9 } ``` ## Data Split | name |train|validation|test| |---------|----:|---:|---:| |BüyükSinema Movie Reviews|67328|10000|10000| ## Benchmarking This dataset is a part of [TRGLUE](https://huggingface.co/datasets/turkish-nlp-suite/TrGLUE) and [SentiTurca](https://huggingface.co/datasets/turkish-nlp-suite/SentiTurca) benchmarks, in the benchmark the subset name is **TrSST-2**, named according to the GLUE tasks. Also the TrGLUE and SentiTurca tasks are binary classification tasks to follow original GLUE conventions. In this repo, you can access the original star ratings if you want a challenge. We benchmarked the transformer based model BERTurk on the binary classification task, this model achieved a **0.67** Matthews's correlation coefficient. More information can be found in the [research paper]() and benchmarking code can be found under [TrGLUE Github repo](https://github.com/turkish-nlp-suite/TrGLUE). ## Citation Coming soon!!
提供机构:
turkish-nlp-suite
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作