dat201204/vietnamese-caucu-comments

Name: dat201204/vietnamese-caucu-comments
Creator: dat201204
Published: 2026-03-26 12:36:31
License: 暂无描述

Hugging Face2026-03-26 更新2026-03-29 收录

下载链接：

https://hf-mirror.com/datasets/dat201204/vietnamese-caucu-comments

下载链接

链接失效反馈

官方服务：

资源简介：

--- language: - vi license: other pretty_name: Vietnamese Cau Cuu Facebook Comments tags: - vietnamese - disaster-response - emergency-detection - facebook-comments - text-classification task_categories: - text-classification task_ids: - binary-classification size_categories: - 1K<n<10K annotations_creators: - machine-generated source_datasets: - original --- # Vietnamese Cau Cuu Facebook Comments ## Dataset Summary This dataset contains Vietnamese Facebook comments collected from a natural-disaster discussion thread and auto-labeled for binary emergency detection. The target task is to detect whether a comment is a real-time rescue request (`cau_cuu`) versus a non-emergency comment (`khong_phai_cau_cuu`). This release is intended as a bootstrap dataset for triage modeling and should be treated as a weakly supervised resource. Human review is strongly recommended before production use. ## Task Definition - `0`: `khong_phai_cau_cuu` Non-emergency content such as sympathy, reposts, hotline aggregation, updates that the family is already safe, or unrelated discussion. - `1`: `cau_cuu` Active rescue requests where people are trapped, in immediate danger, isolated, or explicitly requesting emergency evacuation/support. Priority metric for downstream models: recall on label `1`. ## Data Source - Source type: Vietnamese Facebook comments from a disaster-related post/thread. - Data was flattened from both top-level comments and nested replies. - Original extraction and weak labeling were produced locally for research and experimentation. ## Data Fields - `id`: Stable synthetic identifier derived from comment tree position. - `text`: Raw Vietnamese comment text. - `label`: Binary weak label (`0` or `1`). - `confidence`: Heuristic confidence score from the auto-labeling pipeline. ## Class Distribution Current version statistics: - Total rows: `1492` - Label `0`: `1050` - Label `1`: `442` ## Labeling Method Labels were assigned with a rule-based weak supervision pipeline using: - urgent rescue keywords such as `cứu`, `mắc kẹt`, `ngập tới mái`, `khẩn cấp`, `SOS` - structural signals such as phone numbers, map links, GPS-like coordinates, and location mentions - negative filters for resolved cases (`đã được cứu`, `đã an toàn`) and hotline/broadcast style comments Because labels are weakly supervised, false positives and false negatives remain possible. ## Recommended Use - Training or bootstrapping a Vietnamese emergency comment classifier - Error analysis and heuristic refinement - Human-in-the-loop triage experiments ## Limitations and Ethics - Comments may contain sensitive situational information such as phone numbers and addresses. - Labels are machine-generated and not fully human-verified. - Do not use this dataset for surveillance or any harmful downstream purpose. - Review privacy, legal, and platform-policy constraints before redistribution or deployment. ## Citation If you use this dataset, please cite the project/repository that publishes this dataset card and describe it as a weakly supervised Vietnamese rescue-request classification dataset.

--- 语言： - 越南语许可协议：其他数据集名称：越南语Cau Cuu Facebook评论数据集标签： - 越南语 - 灾害响应 - 紧急情况检测 - Facebook评论 - 文本分类（text-classification）任务类别： - 文本分类（text-classification）任务子类别： - 二进制分类（binary-classification）规模类别： - 1K<n<10K 注释生成方式： - 机器生成源数据集： - 原始数据集 --- # 越南语Cau Cuu Facebook评论数据集 ## 数据集摘要本数据集收录自自然灾害讨论帖的越南语Facebook评论，并经过自动标注以用于二进制紧急情况检测任务。目标任务为区分评论是否为实时救援请求（`cau_cuu`）与非紧急评论（`khong_phai_cau_cuu`）。本版本数据集旨在作为分类建模的基准数据集，可视为弱监督（weakly supervised）资源，建议在生产使用前进行人工审核。 ## 任务定义 - `0`: `khong_phai_cau_cuu`（非求助）：非紧急内容，包括慰问、转发、热线汇总、家属已安全的通报，或无关讨论。 - `1`: `cau_cuu`（求助）：主动救援请求，涵盖人员被困、身处即时危险、孤立无援，或明确请求紧急疏散/支援的内容。下游模型的优先评估指标：标签`1`的召回率（recall）。 ## 数据来源 - 来源类型：来自与灾害相关的帖子/讨论帖的越南语Facebook评论。 - 数据已展平，包含顶层评论及嵌套回复。 - 原始提取与弱标注工作为本地开展，用于研究与实验。 ## 数据字段 - `id`：基于评论树位置生成的稳定合成标识符。 - `text`：原始越南语评论文本。 - `label`：二进制弱标注标签（`0`或`1`）。 - `confidence`：自动标注流程生成的启发式置信度分数。 ## 类别分布当前版本统计数据： - 总样本数：`1492` - 标签`0`：`1050` - 标签`1`：`442` ## 标注方法标签通过基于规则的弱监督流水线完成标注，使用的规则包括： - 紧急救援关键词，如`cứu`、`mắc kẹt`、`ngập tới mái`、`khẩn cấp`、`SOS`； - 结构信号，如电话号码、地图链接、类GPS坐标及位置提及； - 负向过滤规则，用于过滤已获救（`đã được cứu`）、已安全（`đã an toàn`）的案例，以及热线/广播类评论。由于标签为弱监督生成，仍可能存在假阳性与假阴性样本。 ## 推荐用途 - 训练或基准测试越南语紧急评论分类器 - 错误分析与启发式规则优化 - 人机协同分类实验 ## 局限性与伦理说明 - 评论可能包含敏感情境信息，如电话号码与地址。 - 标签为机器生成，未经过完全人工审核。 - 不得将本数据集用于监控或任何有害的下游用途。 - 在重新分发或部署前，请审查隐私、法律及平台政策约束。 ## 引用说明若使用本数据集，请引用发布本数据集卡片的项目/仓库，并将其描述为一个弱监督的越南语救援请求分类数据集。

提供机构：

dat201204

5,000+

优质数据集

54 个

任务类型

进入经典数据集