five

Jianshu001/arabic-daily-smoke-v4-10

收藏
Hugging Face2026-04-24 更新2026-04-26 收录
下载链接:
https://hf-mirror.com/datasets/Jianshu001/arabic-daily-smoke-v4-10
下载链接
链接失效反馈
官方服务:
资源简介:
Smoke v4数据集是一个包含7条记录的小型阿拉伯语数据集,通过协议兼容的流程生成。生成过程包括使用v4系统提示生成内容,涵盖5个敏感领域和特定的安全规则,如反权威、反赞美、反文章和多轮连贯性要求。随后通过Gemma-as-rewriter进行清理,包括草案脚手架、角色/风格标签等。每条记录通过gpt-5.4-mini进行6个维度的LLM判断(真实性、助理质量、多轮、领域适配、安全性、完整性),任何维度不合格的记录都会被丢弃。从10条生成的记录中,7条通过了所有6个维度的判断。数据集的模式包括用户和助理的对话,用户有turn、role、text字段,助理有turn、role、thinking、text字段。

The Smoke v4 dataset is a small Arabic dataset containing 7 records, generated through a protocol-compliant pipeline. The generation process involves using the v4 system prompt to create content covering 5 sensitive domains with specific safety rules, such as anti-authority, anti-praise, anti-article, and multi-turn coherence requirements. The content is then cleaned up via Gemma-as-rewriter, covering draft scaffolding, role/style labels, etc. Each record is judged by gpt-5.4-mini on 6 dimensions (realism, assistant_quality, multi_turn, domain_fit, safety, integrity), and any record with a dirty dimension is dropped without rewriting to preserve the thinking↔text correspondence. Out of 10 generated records, 7 passed all 6 judge dimensions. The dataset schema includes user and assistant dialogues, with user fields being turn, role, text, and assistant fields being turn, role, thinking, text.
提供机构:
Jianshu001
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作