dat201204/vietnamese-caucu-comments
收藏Hugging Face2026-03-26 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/dat201204/vietnamese-caucu-comments
下载链接
链接失效反馈官方服务:
资源简介:
---
language:
- vi
license: other
pretty_name: Vietnamese Cau Cuu Facebook Comments
tags:
- vietnamese
- disaster-response
- emergency-detection
- facebook-comments
- text-classification
task_categories:
- text-classification
task_ids:
- binary-classification
size_categories:
- 1K<n<10K
annotations_creators:
- machine-generated
source_datasets:
- original
---
# Vietnamese Cau Cuu Facebook Comments
## Dataset Summary
This dataset contains Vietnamese Facebook comments collected from a natural-disaster discussion thread and auto-labeled for binary emergency detection.
The target task is to detect whether a comment is a real-time rescue request (`cau_cuu`) versus a non-emergency comment (`khong_phai_cau_cuu`).
This release is intended as a bootstrap dataset for triage modeling and should be treated as a weakly supervised resource. Human review is strongly recommended before production use.
## Task Definition
- `0`: `khong_phai_cau_cuu`
Non-emergency content such as sympathy, reposts, hotline aggregation, updates that the family is already safe, or unrelated discussion.
- `1`: `cau_cuu`
Active rescue requests where people are trapped, in immediate danger, isolated, or explicitly requesting emergency evacuation/support.
Priority metric for downstream models: recall on label `1`.
## Data Source
- Source type: Vietnamese Facebook comments from a disaster-related post/thread.
- Data was flattened from both top-level comments and nested replies.
- Original extraction and weak labeling were produced locally for research and experimentation.
## Data Fields
- `id`: Stable synthetic identifier derived from comment tree position.
- `text`: Raw Vietnamese comment text.
- `label`: Binary weak label (`0` or `1`).
- `confidence`: Heuristic confidence score from the auto-labeling pipeline.
## Class Distribution
Current version statistics:
- Total rows: `1492`
- Label `0`: `1050`
- Label `1`: `442`
## Labeling Method
Labels were assigned with a rule-based weak supervision pipeline using:
- urgent rescue keywords such as `cứu`, `mắc kẹt`, `ngập tới mái`, `khẩn cấp`, `SOS`
- structural signals such as phone numbers, map links, GPS-like coordinates, and location mentions
- negative filters for resolved cases (`đã được cứu`, `đã an toàn`) and hotline/broadcast style comments
Because labels are weakly supervised, false positives and false negatives remain possible.
## Recommended Use
- Training or bootstrapping a Vietnamese emergency comment classifier
- Error analysis and heuristic refinement
- Human-in-the-loop triage experiments
## Limitations and Ethics
- Comments may contain sensitive situational information such as phone numbers and addresses.
- Labels are machine-generated and not fully human-verified.
- Do not use this dataset for surveillance or any harmful downstream purpose.
- Review privacy, legal, and platform-policy constraints before redistribution or deployment.
## Citation
If you use this dataset, please cite the project/repository that publishes this dataset card and describe it as a weakly supervised Vietnamese rescue-request classification dataset.
---
语言:
- 越南语
许可协议:其他
数据集名称:越南语Cau Cuu Facebook评论数据集
标签:
- 越南语
- 灾害响应
- 紧急情况检测
- Facebook评论
- 文本分类(text-classification)
任务类别:
- 文本分类(text-classification)
任务子类别:
- 二进制分类(binary-classification)
规模类别:
- 1K<n<10K
注释生成方式:
- 机器生成
源数据集:
- 原始数据集
---
# 越南语Cau Cuu Facebook评论数据集
## 数据集摘要
本数据集收录自自然灾害讨论帖的越南语Facebook评论,并经过自动标注以用于二进制紧急情况检测任务。
目标任务为区分评论是否为实时救援请求(`cau_cuu`)与非紧急评论(`khong_phai_cau_cuu`)。
本版本数据集旨在作为分类建模的基准数据集,可视为弱监督(weakly supervised)资源,建议在生产使用前进行人工审核。
## 任务定义
- `0`: `khong_phai_cau_cuu`(非求助):
非紧急内容,包括慰问、转发、热线汇总、家属已安全的通报,或无关讨论。
- `1`: `cau_cuu`(求助):
主动救援请求,涵盖人员被困、身处即时危险、孤立无援,或明确请求紧急疏散/支援的内容。
下游模型的优先评估指标:标签`1`的召回率(recall)。
## 数据来源
- 来源类型:来自与灾害相关的帖子/讨论帖的越南语Facebook评论。
- 数据已展平,包含顶层评论及嵌套回复。
- 原始提取与弱标注工作为本地开展,用于研究与实验。
## 数据字段
- `id`:基于评论树位置生成的稳定合成标识符。
- `text`:原始越南语评论文本。
- `label`:二进制弱标注标签(`0`或`1`)。
- `confidence`:自动标注流程生成的启发式置信度分数。
## 类别分布
当前版本统计数据:
- 总样本数:`1492`
- 标签`0`:`1050`
- 标签`1`:`442`
## 标注方法
标签通过基于规则的弱监督流水线完成标注,使用的规则包括:
- 紧急救援关键词,如`cứu`、`mắc kẹt`、`ngập tới mái`、`khẩn cấp`、`SOS`;
- 结构信号,如电话号码、地图链接、类GPS坐标及位置提及;
- 负向过滤规则,用于过滤已获救(`đã được cứu`)、已安全(`đã an toàn`)的案例,以及热线/广播类评论。
由于标签为弱监督生成,仍可能存在假阳性与假阴性样本。
## 推荐用途
- 训练或基准测试越南语紧急评论分类器
- 错误分析与启发式规则优化
- 人机协同分类实验
## 局限性与伦理说明
- 评论可能包含敏感情境信息,如电话号码与地址。
- 标签为机器生成,未经过完全人工审核。
- 不得将本数据集用于监控或任何有害的下游用途。
- 在重新分发或部署前,请审查隐私、法律及平台政策约束。
## 引用说明
若使用本数据集,请引用发布本数据集卡片的项目/仓库,并将其描述为一个弱监督的越南语救援请求分类数据集。
提供机构:
dat201204



