w11wo/imdb-javanese
收藏数据集卡片 for "imdb-javanese"
数据集描述
数据集摘要
大型电影评论数据集翻译成爪哇语。这是一个用于二元情感分类的数据集,包含比先前基准数据集更多的数据。我们提供了25,000条高度极性的电影评论用于训练,以及25,000条用于测试。还有额外的未标记数据可供使用。我们使用多语言MarianMT Transformer模型从Helsinki-NLP/opus-mt-en-mul将原始IMDB数据集翻译成爪哇语。
支持的任务和排行榜
语言
数据集结构
数据实例
javanese_imdb_train.csv的一个示例如下:
| label | text |
|---|---|
| 1 | "Drama romantik sing digawé karo direktur Martin Ritt kuwi ora dingertèni, nanging ana momen-momen sing marahi karisma lintang Jane Fonda lan Robert De Niro (kelompok sing luar biasa). Dhèwèké dadi randha sing ora isa mlaku, iso anu anyar lan anyar-inventor-- kowé isa nganggep isiné. Adapsi novel Pat Barker ""Union Street"" (yak titel sing apik!) arep dinggo-back-back it on bland, lan pendidikan film kuwi gampang, nanging isih nyenengké; a rosy-hued-inventor-fantasi. Ora ana sing ngganggu gambar sing sejati ding kok iso dinggo nggawe gambar sing paling nyeneng." |
| 0 | "Pengalaman wong lanang sing nduwé perasaan sing ora lumrah kanggo babi. Mulai nganggo tuladha sing luar biasa yaiku komedia. Wong orkestra termel digawé dadi wong gila, sing kasar merga nyanyian nyanyi. Sayangé, kuwi tetep absurd wektu WHOLE tanpa ceramah umum sing mung digawé. Malah, sing ana ing jaman kuwi kudu ditinggalké. Diyalog kryptik sing nggawé Shakespeare marah gampang kanggo kelas telu. Pak teknis kuwi luwih apik timbang kowe mikir nganggo cinematografi sing apik sing jenengé Vilmos Zsmond. Masa depan bintang Saly Kirkland lan Frederic Forrest isa ndelok." |
数据字段
text: 电影评论翻译成爪哇语。label: 评论中展现的情感,1表示正面,0表示负面。
数据分割样本大小
| train | unsupervised | test |
|---|---|---|
| 25000 | 50000 | 25000 |
数据集创建
策划理由
源数据
初始数据收集和规范化
源语言生产者是谁?
注释
注释过程
注释者是谁?
个人和敏感信息
使用数据集的注意事项
数据集的社会影响
偏见的讨论
其他已知限制
附加信息
数据集策展人
许可信息
引用信息
如果您在研究中使用此数据集,请引用:
@inproceedings{wongso2021causal, title={Causal and Masked Language Modeling of Javanese Language using Transformer-based Architectures}, author={Wongso, Wilson and Setiawan, David Samuel and Suhartono, Derwin}, booktitle={2021 International Conference on Advanced Computer Science and Information Systems (ICACSIS)}, pages={1--7}, year={2021}, organization={IEEE} }
@InProceedings{maas-EtAl:2011:ACL-HLT2011, author = {Maas, Andrew L. and Daly, Raymond E. and Pham, Peter T. and Huang, Dan and Ng, Andrew Y. and Potts, Christopher}, title = {Learning Word Vectors for Sentiment Analysis}, booktitle = {Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies}, month = {June}, year = {2011}, address = {Portland, Oregon, USA}, publisher = {Association for Computational Linguistics}, pages = {142--150}, url = {http://www.aclweb.org/anthology/P11-1015} }



