five

douyipu-real/mosaic

收藏
Hugging Face2026-03-19 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/douyipu-real/mosaic
下载链接
链接失效反馈
官方服务:
资源简介:
--- pretty_name: MOSAIC Dataset language: - en license: other task_categories: - text-generation tags: - alignment - safety - instruction-following - dataset-curation - multi-objective-optimization size_categories: - 10K<n<100K configs: - config_name: source_pools data_files: - split: xguard path: data/source_pools/xguard-train.parquet - split: orbench path: data/source_pools/orbench.parquet - split: ifeval path: data/source_pools/tulu-3.parquet - config_name: mosaic_search_subsets data_files: - split: iter_00 path: data/search_subsets/iter_00_train_data.parquet - split: iter_01 path: data/search_subsets/iter_01_train_data.parquet - split: iter_02 path: data/search_subsets/iter_02_train_data.parquet - split: iter_03 path: data/search_subsets/iter_03_train_data.parquet - split: iter_04 path: data/search_subsets/iter_04_train_data.parquet - config_name: mosaic_search_log data_files: - split: train path: data/metadata/search_iterations.parquet --- # MOSAIC Dataset This repository packages the public MOSAIC data artifacts from the paper **"MOSAIC: Multi-Objective Slice-Aware Iterative Curation for Alignment."** `MOSAIC` is short for **Multi-Objective Slice-Aware Iterative Curation for Alignment**. It contains three annotated source training pools and five training subsets selected by the MOSAIC search loop under a fixed 1M-token budget. The release also includes flattened iteration metadata so the search trajectory can be inspected directly inside the Hugging Face dataset viewer. ## Included configs ### `source_pools` The three pre-scored training pools used by MOSAIC: - `xguard`: 30695 rows - `orbench`: 7927 rows - `ifeval`: 9076 rows These parquet files retain the slice-level supervision fields used by the search loop, such as `score`, `slice`, and `need`. ### `mosaic_search_subsets` Five iteration-specific training subsets selected by the MOSAIC closed loop: - `iter_00` to `iter_04` Together they contain 7070 rows across the five released subsets. ### `mosaic_search_log` A flattened table with one row for the baseline and one row per released iteration. It includes: - per-objective scores - dataset-level mixture weights - bucket weights - focus criteria - selected row counts - estimated token counts - final train loss and runtime The `mosaic_search_log` table has 6 rows. ## Data fields ### Source pools The source pools are multi-turn or single-turn conversation records with nested `messages` fields. Depending on the source, they also contain: - `metadata` - `prompt` - `constraints` - `score` - `slice` - `need` - task-specific diagnostic columns such as `_pressure`, `_concealment`, `_safety`, `_refusal_type`, `_help_level`, `_friction`, `_inst_complexity`, and `_failed_constraints` ### Search subsets The released subset files contain the exact training slices sampled by the search loop. Common fields include: - `messages` - `source_id` - `window_id` - `metadata` - `need` - `prompt` - `constraints` - `n_windows_total` ## Safety note This release includes safety-alignment data and therefore contains harmful, adversarial, and jailbreak-oriented prompts. It is intended for research on safety evaluation, over-refusal calibration, and alignment data construction. Do not deploy it as a user-facing dataset without an additional review layer. ## Redistribution note This repository is a curated release of experiment artifacts. Some data pools are derived from upstream datasets and benchmarks. Before making the repository public, verify that redistribution is compatible with the licenses and terms of the original sources used to build these files. ## Summary statistics - Total source-pool rows: 47698 - Total released search-subset rows: 7070 - Source-pool payload: 304.62 MB - Search-subset payload: 12.77 MB ## Additional metadata The following files are included outside the dataset configs: - `metadata/iteration_summary.json` - `metadata/config.json` - `metadata/full_history.json` - `metadata/pareto_archive.json` - `metadata/final_report.md` - `metadata/release_manifest.json` These files are useful when reproducing the paper or inspecting the full search trajectory outside the dataset viewer.
提供机构:
douyipu-real
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作