Jianshu001/arabic-daily-smoke-v4-10

Name: Jianshu001/arabic-daily-smoke-v4-10
Creator: Jianshu001
Published: 2026-04-24 09:27:16
License: 暂无描述

Hugging Face2026-04-24 更新2026-04-26 收录

下载链接：

https://hf-mirror.com/datasets/Jianshu001/arabic-daily-smoke-v4-10

下载链接

链接失效反馈

官方服务：

资源简介：

Smoke v4数据集是一个包含7条记录的小型阿拉伯语数据集，通过协议兼容的流程生成。生成过程包括使用v4系统提示生成内容，涵盖5个敏感领域和特定的安全规则，如反权威、反赞美、反文章和多轮连贯性要求。随后通过Gemma-as-rewriter进行清理，包括草案脚手架、角色/风格标签等。每条记录通过gpt-5.4-mini进行6个维度的LLM判断（真实性、助理质量、多轮、领域适配、安全性、完整性），任何维度不合格的记录都会被丢弃。从10条生成的记录中，7条通过了所有6个维度的判断。数据集的模式包括用户和助理的对话，用户有turn、role、text字段，助理有turn、role、thinking、text字段。

The Smoke v4 dataset is a small Arabic dataset containing 7 records, generated through a protocol-compliant pipeline. The generation process involves using the v4 system prompt to create content covering 5 sensitive domains with specific safety rules, such as anti-authority, anti-praise, anti-article, and multi-turn coherence requirements. The content is then cleaned up via Gemma-as-rewriter, covering draft scaffolding, role/style labels, etc. Each record is judged by gpt-5.4-mini on 6 dimensions (realism, assistant_quality, multi_turn, domain_fit, safety, integrity), and any record with a dirty dimension is dropped without rewriting to preserve the thinking↔text correspondence. Out of 10 generated records, 7 passed all 6 judge dimensions. The dataset schema includes user and assistant dialogues, with user fields being turn, role, text, and assistant fields being turn, role, thinking, text.

提供机构：

Jianshu001

5,000+

优质数据集

54 个

任务类型

进入经典数据集