SEACrowd/visobert
收藏Visobert 数据集概述
基本信息
- 名称: Visobert
- 语言: 越南语 (vie)
- 任务类别: 自监督预训练 (self-supervised-pretraining)
- 标签: 自监督预训练 (self-supervised-pretraining)
- 许可证: Creative Commons Attribution Non Commercial 4.0 (cc-by-nc-4.0)
数据集描述
- 来源: 从Facebook、TikTok和YouTube爬取的越南语文本数据。
- 内容: 包含Facebook帖子、TikTok评论和YouTube评论,数据来自越南认证用户,时间范围从2016年1月(TikTok数据从2020年1月)到2022年12月。
- 预处理: 处理了标签、表情符号、拼写错误、超链接和其他非规范文本。
使用方法
使用 datasets 库
python from datasets import load_dataset dset = datasets.load_dataset("SEACrowd/visobert", trust_remote_code=True)
使用 seacrowd 库
python import seacrowd as sc
使用默认配置加载数据集
dset = sc.load_dataset("visobert", schema="seacrowd")
查看数据集的所有可用子集(配置名称)
print(sc.available_config_names("visobert"))
使用特定配置加载数据集
dset = sc.load_dataset_by_config_name(config_name="<config_name>")
版本信息
- 源版本: 1.0.0
- SEACrowd版本: 2024.06.20
引用
bibtex @inproceedings{nguyen-etal-2023-visobert, title = "{V}i{S}o{BERT}: A Pre-Trained Language Model for {V}ietnamese Social Media Text Processing", author = "Nguyen, Nam and Phan, Thang and Nguyen, Duc-Vu and Nguyen, Kiet", editor = "Bouamor, Houda and Pino, Juan and Bali, Kalika", booktitle = "Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing", month = dec, year = "2023", address = "Singapore", publisher = "Association for Computational Linguistics", url = "https://aclanthology.org/2023.emnlp-main.315", pages = "5191--5207", abstract = "English and Chinese, known as resource-rich languages, have witnessed the strong development of transformer-based language models for natural language processing tasks. Although Vietnam has approximately 100M people speaking Vietnamese, several pre-trained models, e.g., PhoBERT, ViBERT, and vELECTRA, performed well on general Vietnamese NLP tasks, including POS tagging and named entity recognition. These pre-trained language models are still limited to Vietnamese social media tasks. In this paper, we present the first monolingual pre-trained language model for Vietnamese social media texts, ViSoBERT, which is pre-trained on a large-scale corpus of high-quality and diverse Vietnamese social media texts using XLM-R architecture. Moreover, we explored our pre-trained model on five important natural language downstream tasks on Vietnamese social media texts: emotion recognition, hate speech detection, sentiment analysis, spam reviews detection, and hate speech spans detection. Our experiments demonstrate that ViSoBERT, with far fewer parameters, surpasses the previous state-of-the-art models on multiple Vietnamese social media tasks. Our ViSoBERT model is available only for research purposes. Disclaimer: This paper contains actual comments on social networks that might be construed as abusive, offensive, or obscene.", }
@article{lovenia2024seacrowd, title={SEACrowd: A Multilingual Multimodal Data Hub and Benchmark Suite for Southeast Asian Languages}, author={Holy Lovenia and Rahmad Mahendra and Salsabil Maulana Akbar and Lester James V. Miranda and Jennifer Santoso and Elyanah Aco and Akhdan Fadhilah and Jonibek Mansurov and Joseph Marvin Imperial and Onno P. Kampman and Joel Ruben Antony Moniz and Muhammad Ravi Shulthan Habibi and Frederikus Hudi and Railey Montalan and Ryan Ignatius and Joanito Agili Lopo and William Nixon and Börje F. Karlsson and James Jaya and Ryandito Diandaru and Yuze Gao and Patrick Amadeus and Bin Wang and Jan Christian Blaise Cruz and Chenxi Whitehouse and Ivan Halim Parmonangan and Maria Khelli and Wenyu Zhang and Lucky Susanto and Reynard Adha Ryanda and Sonny Lazuardi Hermawan and Dan John Velasco and Muhammad Dehan Al Kautsar and Willy Fitra Hendria and Yasmin Moslem and Noah Flynn and Muhammad Farid Adilazuarda and Haochen Li and Johanes Lee and R. Damanhuri and Shuo Sun and Muhammad Reza Qorib and Amirbek Djanibekov and Wei Qi Leong and Quyet V. Do and Niklas Muennighoff and Tanrada Pansuwan and Ilham Firdausi Putra and Yan Xu and Ngee Chia Tai and Ayu Purwarianti and Sebastian Ruder and William Tjhi and Peerat Limkonchotiwat and Alham Fikri Aji and Sedrick Keh and Genta Indra Winata and Ruochen Zhang and Fajri Koto and Zheng-Xin Yong and Samuel Cahyawijaya}, year={2024}, eprint={2406.10118}, journal={arXiv preprint arXiv: 2406.10118} }



