omilab/hebrew_sentiment
收藏数据集概述
数据集描述
数据集摘要
HebrewSentiment 是一个包含 12,804 条用户对以色列总统 Reuven Rivlin 官方 Facebook 页面帖子评论的数据集。这些评论是在 2014 年 6 月至 8 月期间收集的,涵盖了总统就职的前三个月。评论的情感分为三类:370 条中性评论,8,512 条正面评论,3,922 条负面评论。
支持的任务和排行榜
情感分析
语言
希伯来语
数据集结构
数据实例
- 示例:
- רובי הייתי רוצה לראות ערביה נישאת ליהודי 1
- תמונה יפיפיה-שפו 0
- חייבים לעשות סוג של חרם כשכתבים שונאי ישראל עולים לשידור צריכים להעביר לערוץ אחר ואז תראו מה יעשה כוחו של הרייטינג ( בהקשר לדבריה של רינה מצליח ) 2
数据字段
text: 现代希伯来语输入文本。label: 情感标签。0=正面,1=负面,2=无关主题。
数据分割
| train | test | |
|---|---|---|
| HebrewSentiment (token) | 10243 | 2559 |
| HebrewSentiment (morph) | 10243 | 2559 |
数据集创建
数据收集和规范化
用户对以色列总统 Reuven Rivlin 官方 Facebook 页面帖子的评论。2015 年 10 月,我们使用开源软件 Netvizz (Rieder, 2013) 抓取了 2014 年 6 月至 8 月期间的所有评论。
标注过程
经过培训的研究人员检查每条评论并确定其情感值,其中整体正面情感的评论被分配值 0,整体负面情感的评论被分配值 1,与帖子内容无关的评论被分配值 2。我们通过让第二位经过培训的研究人员对相同数据进行编码来验证编码方案。评级者之间有显著的一致性(N of agreements: 10623, N of disagreements: 2105, Coehn’s Kappa = 0.697, p = 0)。
标注者
研究人员
使用数据的注意事项
数据集的社会影响
[更多信息需要]
偏见的讨论
[更多信息需要]
其他已知限制
[更多信息需要]
附加信息
数据集策展人
OMIlab, The Open University of Israel
许可信息
MIT License
引用信息
@inproceedings{amram-etal-2018-representations, title = "Representations and Architectures in Neural Sentiment Analysis for Morphologically Rich Languages: A Case Study from {M}odern {H}ebrew", author = "Amram, Adam and Ben David, Anat and Tsarfaty, Reut", booktitle = "Proceedings of the 27th International Conference on Computational Linguistics", month = aug, year = "2018", address = "Santa Fe, New Mexico, USA", publisher = "Association for Computational Linguistics", url = "https://www.aclweb.org/anthology/C18-1190", pages = "2242--2252", abstract = "This paper empirically studies the effects of representation choices on neural sentiment analysis for Modern Hebrew, a morphologically rich language (MRL) for which no sentiment analyzer currently exists. We study two dimensions of representational choices: (i) the granularity of the input signal (token-based vs. morpheme-based), and (ii) the level of encoding of vocabulary items (string-based vs. character-based). We hypothesise that for MRLs, languages where multiple meaning-bearing elements may be carried by a single space-delimited token, these choices will have measurable effects on task perfromance, and that these effects may vary for different architectural designs {---} fully-connected, convolutional or recurrent. Specifically, we hypothesize that morpheme-based representations will have advantages in terms of their generalization capacity and task accuracy, due to their better OOV coverage. To empirically study these effects, we develop a new sentiment analysis benchmark for Hebrew, based on 12K social media comments, and provide two instances of these data: in token-based and morpheme-based settings. Our experiments show that representation choices empirical effects vary with architecture type. While fully-connected and convolutional networks slightly prefer token-based settings, RNNs benefit from a morpheme-based representation, in accord with the hypothesis that explicit morphological information may help generalize. Our endeavour also delivers the first state-of-the-art broad-coverage sentiment analyzer for Hebrew, with over 89{%} accuracy, alongside an established benchmark to further study the effects of linguistic representation choices on neural networks{} task performance.", }
贡献
感谢 @elronbandel 添加此数据集。



