dumitrescustefan/ro_sent

Name: dumitrescustefan/ro_sent
Creator: dumitrescustefan
Published: 2024-01-18 11:14:48
License: 暂无描述

Hugging Face2024-01-18 更新2024-05-25 收录

下载链接：

https://hf-mirror.com/datasets/dumitrescustefan/ro_sent

下载链接

链接失效反馈

官方服务：

资源简介：

--- annotations_creators: - found language_creators: - found language: - ro license: - unknown multilinguality: - monolingual size_categories: - 10K<n<100K source_datasets: - original task_categories: - text-classification task_ids: - sentiment-classification pretty_name: RoSent dataset_info: features: - name: original_id dtype: string - name: id dtype: string - name: sentence dtype: string - name: label dtype: class_label: names: '0': negative '1': positive splits: - name: train num_bytes: 8367687 num_examples: 17941 - name: test num_bytes: 6837430 num_examples: 11005 download_size: 14700057 dataset_size: 15205117 --- # Dataset Card for RoSent ## Table of Contents - [Dataset Description](#dataset-description) - [Dataset Summary](#dataset-summary) - [Supported Tasks and Leaderboards](#supported-tasks-and-leaderboards) - [Languages](#languages) - [Dataset Structure](#dataset-structure) - [Data Instances](#data-instances) - [Data Fields](#data-fields) - [Data Splits](#data-splits) - [Dataset Creation](#dataset-creation) - [Curation Rationale](#curation-rationale) - [Source Data](#source-data) - [Annotations](#annotations) - [Personal and Sensitive Information](#personal-and-sensitive-information) - [Considerations for Using the Data](#considerations-for-using-the-data) - [Social Impact of Dataset](#social-impact-of-dataset) - [Discussion of Biases](#discussion-of-biases) - [Other Known Limitations](#other-known-limitations) - [Additional Information](#additional-information) - [Dataset Curators](#dataset-curators) - [Licensing Information](#licensing-information) - [Citation Information](#citation-information) - [Contributions](#contributions) ## Dataset Description - **Homepage:** [GitHub](https://github.com/dumitrescustefan/Romanian-Transformers/tree/examples/examples/sentiment_analysis) - **Repository:** [GitHub](https://github.com/dumitrescustefan/Romanian-Transformers/tree/examples/examples/sentiment_analysis) - **Paper:** [arXiv preprint](https://arxiv.org/pdf/2009.08712.pdf) - **Leaderboard:** - **Point of Contact:** ### Dataset Summary This dataset is a Romanian Sentiment Analysis dataset. It is present in a processed form, as used by the authors of [`Romanian Transformers`](https://github.com/dumitrescustefan/Romanian-Transformers) in their examples and based on the original data present in at [this GitHub repository](https://github.com/katakonst/sentiment-analysis-tensorflow). The original data contains product and movie reviews in Romanian. ### Supported Tasks and Leaderboards [More Information Needed] ### Languages This dataset is present in Romanian language. ## Dataset Structure ### Data Instances An instance from the `train` split: ``` {'id': '0', 'label': 1, 'original_id': '0', 'sentence': 'acest document mi-a deschis cu adevarat ochii la ceea ce oamenii din afara statelor unite s-au gandit la atacurile din 11 septembrie. acest film a fost construit in mod expert si prezinta acest dezastru ca fiind mai mult decat un atac asupra pamantului american. urmarile acestui dezastru sunt previzionate din multe tari si perspective diferite. cred ca acest film ar trebui sa fie mai bine distribuit pentru acest punct. de asemenea, el ajuta in procesul de vindecare sa vada in cele din urma altceva decat stirile despre atacurile teroriste. si unele dintre piese sunt de fapt amuzante, dar nu abuziv asa. acest film a fost extrem de recomandat pentru mine, si am trecut pe acelasi sentiment.'} ``` ### Data Fields - `original_id`: a `string` feature containing the original id from the file. - `id`: a `string` feature . - `sentence`: a `string` feature. - `label`: a classification label, with possible values including `negative` (0), `positive` (1). ### Data Splits This dataset has two splits: `train` with 17941 examples, and `test` with 11005 examples. ## Dataset Creation ### Curation Rationale [More Information Needed] ### Source Data #### Initial Data Collection and Normalization The source dataset is present at the [this GitHub repository](https://github.com/katakonst/sentiment-analysis-tensorflow) and is based on product and movie reviews. The original source is unknown. #### Who are the source language producers? [More Information Needed] ### Annotations #### Annotation process [More Information Needed] #### Who are the annotators? [More Information Needed] ### Personal and Sensitive Information [More Information Needed] ## Considerations for Using the Data ### Social Impact of Dataset [More Information Needed] ### Discussion of Biases [More Information Needed] ### Other Known Limitations [More Information Needed] ## Additional Information ### Dataset Curators Stefan Daniel Dumitrescu, Andrei-Marious Avram, Sampo Pyysalo, [@katakonst](https://github.com/katakonst) ### Licensing Information [More Information Needed] ### Citation Information ``` @article{dumitrescu2020birth, title={The birth of Romanian BERT}, author={Dumitrescu, Stefan Daniel and Avram, Andrei-Marius and Pyysalo, Sampo}, journal={arXiv preprint arXiv:2009.08712}, year={2020} } ``` ### Contributions Thanks to [@gchhablani](https://github.com/gchhablani) and [@iliemihai](https://github.com/iliemihai) for adding this dataset.

提供机构：

dumitrescustefan

原始信息汇总

数据集概述

名称： RoSent

语言： 罗马尼亚语

许可证： 未知

多语言性： 单语种

大小分类： 10K<n<100K

源数据集： 原始数据

任务类别： 文本分类

任务ID： 情感分类

数据集结构

数据实例

字段：
- original_id：字符串类型，原始文件中的ID。
- id：字符串类型。
- sentence：字符串类型，句子内容。
- label：分类标签，值包括negative（0）和positive（1）。

数据分片

训练集： 17941个实例
测试集： 11005个实例

数据集创建

源数据

类型： 产品及电影评论
来源： GitHub仓库

数据注释

创建者： 未详细说明

数据集使用考虑

许可证信息

状态： 未知

引用信息

@article{dumitrescu2020birth, title={The birth of Romanian BERT}, author={Dumitrescu, Stefan Daniel and Avram, Andrei-Marius and Pyysalo, Sampo}, journal={arXiv preprint arXiv:2009.08712}, year={2020} }

5,000+

优质数据集

54 个

任务类型

进入经典数据集