five

dat201204/vietnamese-caucu-comments

收藏
Hugging Face2026-03-26 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/dat201204/vietnamese-caucu-comments
下载链接
链接失效反馈
官方服务:
资源简介:
--- language: - vi license: other pretty_name: Vietnamese Cau Cuu Facebook Comments tags: - vietnamese - disaster-response - emergency-detection - facebook-comments - text-classification task_categories: - text-classification task_ids: - binary-classification size_categories: - 1K<n<10K annotations_creators: - machine-generated source_datasets: - original --- # Vietnamese Cau Cuu Facebook Comments ## Dataset Summary This dataset contains Vietnamese Facebook comments collected from a natural-disaster discussion thread and auto-labeled for binary emergency detection. The target task is to detect whether a comment is a real-time rescue request (`cau_cuu`) versus a non-emergency comment (`khong_phai_cau_cuu`). This release is intended as a bootstrap dataset for triage modeling and should be treated as a weakly supervised resource. Human review is strongly recommended before production use. ## Task Definition - `0`: `khong_phai_cau_cuu` Non-emergency content such as sympathy, reposts, hotline aggregation, updates that the family is already safe, or unrelated discussion. - `1`: `cau_cuu` Active rescue requests where people are trapped, in immediate danger, isolated, or explicitly requesting emergency evacuation/support. Priority metric for downstream models: recall on label `1`. ## Data Source - Source type: Vietnamese Facebook comments from a disaster-related post/thread. - Data was flattened from both top-level comments and nested replies. - Original extraction and weak labeling were produced locally for research and experimentation. ## Data Fields - `id`: Stable synthetic identifier derived from comment tree position. - `text`: Raw Vietnamese comment text. - `label`: Binary weak label (`0` or `1`). - `confidence`: Heuristic confidence score from the auto-labeling pipeline. ## Class Distribution Current version statistics: - Total rows: `1492` - Label `0`: `1050` - Label `1`: `442` ## Labeling Method Labels were assigned with a rule-based weak supervision pipeline using: - urgent rescue keywords such as `cứu`, `mắc kẹt`, `ngập tới mái`, `khẩn cấp`, `SOS` - structural signals such as phone numbers, map links, GPS-like coordinates, and location mentions - negative filters for resolved cases (`đã được cứu`, `đã an toàn`) and hotline/broadcast style comments Because labels are weakly supervised, false positives and false negatives remain possible. ## Recommended Use - Training or bootstrapping a Vietnamese emergency comment classifier - Error analysis and heuristic refinement - Human-in-the-loop triage experiments ## Limitations and Ethics - Comments may contain sensitive situational information such as phone numbers and addresses. - Labels are machine-generated and not fully human-verified. - Do not use this dataset for surveillance or any harmful downstream purpose. - Review privacy, legal, and platform-policy constraints before redistribution or deployment. ## Citation If you use this dataset, please cite the project/repository that publishes this dataset card and describe it as a weakly supervised Vietnamese rescue-request classification dataset.

--- 语言: - 越南语 许可协议:其他 数据集名称:越南语Cau Cuu Facebook评论数据集 标签: - 越南语 - 灾害响应 - 紧急情况检测 - Facebook评论 - 文本分类(text-classification) 任务类别: - 文本分类(text-classification) 任务子类别: - 二进制分类(binary-classification) 规模类别: - 1K<n<10K 注释生成方式: - 机器生成 源数据集: - 原始数据集 --- # 越南语Cau Cuu Facebook评论数据集 ## 数据集摘要 本数据集收录自自然灾害讨论帖的越南语Facebook评论,并经过自动标注以用于二进制紧急情况检测任务。 目标任务为区分评论是否为实时救援请求(`cau_cuu`)与非紧急评论(`khong_phai_cau_cuu`)。 本版本数据集旨在作为分类建模的基准数据集,可视为弱监督(weakly supervised)资源,建议在生产使用前进行人工审核。 ## 任务定义 - `0`: `khong_phai_cau_cuu`(非求助): 非紧急内容,包括慰问、转发、热线汇总、家属已安全的通报,或无关讨论。 - `1`: `cau_cuu`(求助): 主动救援请求,涵盖人员被困、身处即时危险、孤立无援,或明确请求紧急疏散/支援的内容。 下游模型的优先评估指标:标签`1`的召回率(recall)。 ## 数据来源 - 来源类型:来自与灾害相关的帖子/讨论帖的越南语Facebook评论。 - 数据已展平,包含顶层评论及嵌套回复。 - 原始提取与弱标注工作为本地开展,用于研究与实验。 ## 数据字段 - `id`:基于评论树位置生成的稳定合成标识符。 - `text`:原始越南语评论文本。 - `label`:二进制弱标注标签(`0`或`1`)。 - `confidence`:自动标注流程生成的启发式置信度分数。 ## 类别分布 当前版本统计数据: - 总样本数:`1492` - 标签`0`:`1050` - 标签`1`:`442` ## 标注方法 标签通过基于规则的弱监督流水线完成标注,使用的规则包括: - 紧急救援关键词,如`cứu`、`mắc kẹt`、`ngập tới mái`、`khẩn cấp`、`SOS`; - 结构信号,如电话号码、地图链接、类GPS坐标及位置提及; - 负向过滤规则,用于过滤已获救(`đã được cứu`)、已安全(`đã an toàn`)的案例,以及热线/广播类评论。 由于标签为弱监督生成,仍可能存在假阳性与假阴性样本。 ## 推荐用途 - 训练或基准测试越南语紧急评论分类器 - 错误分析与启发式规则优化 - 人机协同分类实验 ## 局限性与伦理说明 - 评论可能包含敏感情境信息,如电话号码与地址。 - 标签为机器生成,未经过完全人工审核。 - 不得将本数据集用于监控或任何有害的下游用途。 - 在重新分发或部署前,请审查隐私、法律及平台政策约束。 ## 引用说明 若使用本数据集,请引用发布本数据集卡片的项目/仓库,并将其描述为一个弱监督的越南语救援请求分类数据集。
提供机构:
dat201204
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作