SEACrowd/visobert

Name: SEACrowd/visobert
Creator: SEACrowd
Published: 2024-06-24 13:28:59
License: 暂无描述

Hugging Face2024-06-24 更新2024-06-29 收录

下载链接：

https://hf-mirror.com/datasets/SEACrowd/visobert

下载链接

链接失效反馈

官方服务：

资源简介：

ViSoBERT数据集包含从2016年1月至2022年12月期间，越南认证用户在Facebook、TikTok和YouTube上的帖子、评论。数据经过处理，包括标签、表情符号、拼写错误、超链接等非规范文本。

提供机构：

SEACrowd

原始信息汇总

Visobert 数据集概述

基本信息

名称: Visobert
语言: 越南语 (vie)
任务类别: 自监督预训练 (self-supervised-pretraining)
标签: 自监督预训练 (self-supervised-pretraining)
许可证: Creative Commons Attribution Non Commercial 4.0 (cc-by-nc-4.0)

数据集描述

来源: 从Facebook、TikTok和YouTube爬取的越南语文本数据。
内容: 包含Facebook帖子、TikTok评论和YouTube评论，数据来自越南认证用户，时间范围从2016年1月（TikTok数据从2020年1月）到2022年12月。
预处理: 处理了标签、表情符号、拼写错误、超链接和其他非规范文本。

使用方法

使用 `datasets` 库

python from datasets import load_dataset dset = datasets.load_dataset("SEACrowd/visobert", trust_remote_code=True)

使用 `seacrowd` 库

python import seacrowd as sc

使用默认配置加载数据集

dset = sc.load_dataset("visobert", schema="seacrowd")

查看数据集的所有可用子集（配置名称）

print(sc.available_config_names("visobert"))

使用特定配置加载数据集

dset = sc.load_dataset_by_config_name(config_name="<config_name>")

版本信息

源版本: 1.0.0
SEACrowd版本: 2024.06.20

引用

bibtex @inproceedings{nguyen-etal-2023-visobert, title = "{V}i{S}o{BERT}: A Pre-Trained Language Model for {V}ietnamese Social Media Text Processing", author = "Nguyen, Nam and Phan, Thang and Nguyen, Duc-Vu and Nguyen, Kiet", editor = "Bouamor, Houda and Pino, Juan and Bali, Kalika", booktitle = "Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing", month = dec, year = "2023", address = "Singapore", publisher = "Association for Computational Linguistics", url = "https://aclanthology.org/2023.emnlp-main.315", pages = "5191--5207", abstract = "English and Chinese, known as resource-rich languages, have witnessed the strong development of transformer-based language models for natural language processing tasks. Although Vietnam has approximately 100M people speaking Vietnamese, several pre-trained models, e.g., PhoBERT, ViBERT, and vELECTRA, performed well on general Vietnamese NLP tasks, including POS tagging and named entity recognition. These pre-trained language models are still limited to Vietnamese social media tasks. In this paper, we present the first monolingual pre-trained language model for Vietnamese social media texts, ViSoBERT, which is pre-trained on a large-scale corpus of high-quality and diverse Vietnamese social media texts using XLM-R architecture. Moreover, we explored our pre-trained model on five important natural language downstream tasks on Vietnamese social media texts: emotion recognition, hate speech detection, sentiment analysis, spam reviews detection, and hate speech spans detection. Our experiments demonstrate that ViSoBERT, with far fewer parameters, surpasses the previous state-of-the-art models on multiple Vietnamese social media tasks. Our ViSoBERT model is available only for research purposes. Disclaimer: This paper contains actual comments on social networks that might be construed as abusive, offensive, or obscene.", }

@article{lovenia2024seacrowd, title={SEACrowd: A Multilingual Multimodal Data Hub and Benchmark Suite for Southeast Asian Languages}, author={Holy Lovenia and Rahmad Mahendra and Salsabil Maulana Akbar and Lester James V. Miranda and Jennifer Santoso and Elyanah Aco and Akhdan Fadhilah and Jonibek Mansurov and Joseph Marvin Imperial and Onno P. Kampman and Joel Ruben Antony Moniz and Muhammad Ravi Shulthan Habibi and Frederikus Hudi and Railey Montalan and Ryan Ignatius and Joanito Agili Lopo and William Nixon and Börje F. Karlsson and James Jaya and Ryandito Diandaru and Yuze Gao and Patrick Amadeus and Bin Wang and Jan Christian Blaise Cruz and Chenxi Whitehouse and Ivan Halim Parmonangan and Maria Khelli and Wenyu Zhang and Lucky Susanto and Reynard Adha Ryanda and Sonny Lazuardi Hermawan and Dan John Velasco and Muhammad Dehan Al Kautsar and Willy Fitra Hendria and Yasmin Moslem and Noah Flynn and Muhammad Farid Adilazuarda and Haochen Li and Johanes Lee and R. Damanhuri and Shuo Sun and Muhammad Reza Qorib and Amirbek Djanibekov and Wei Qi Leong and Quyet V. Do and Niklas Muennighoff and Tanrada Pansuwan and Ilham Firdausi Putra and Yan Xu and Ngee Chia Tai and Ayu Purwarianti and Sebastian Ruder and William Tjhi and Peerat Limkonchotiwat and Alham Fikri Aji and Sedrick Keh and Genta Indra Winata and Ruochen Zhang and Fajri Koto and Zheng-Xin Yong and Samuel Cahyawijaya}, year={2024}, eprint={2406.10118}, journal={arXiv preprint arXiv: 2406.10118} }

5,000+

优质数据集

54 个

任务类型

进入经典数据集