douyipu-real/mosaic
收藏Hugging Face2026-03-19 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/douyipu-real/mosaic
下载链接
链接失效反馈官方服务:
资源简介:
---
pretty_name: MOSAIC Dataset
language:
- en
license: other
task_categories:
- text-generation
tags:
- alignment
- safety
- instruction-following
- dataset-curation
- multi-objective-optimization
size_categories:
- 10K<n<100K
configs:
- config_name: source_pools
data_files:
- split: xguard
path: data/source_pools/xguard-train.parquet
- split: orbench
path: data/source_pools/orbench.parquet
- split: ifeval
path: data/source_pools/tulu-3.parquet
- config_name: mosaic_search_subsets
data_files:
- split: iter_00
path: data/search_subsets/iter_00_train_data.parquet
- split: iter_01
path: data/search_subsets/iter_01_train_data.parquet
- split: iter_02
path: data/search_subsets/iter_02_train_data.parquet
- split: iter_03
path: data/search_subsets/iter_03_train_data.parquet
- split: iter_04
path: data/search_subsets/iter_04_train_data.parquet
- config_name: mosaic_search_log
data_files:
- split: train
path: data/metadata/search_iterations.parquet
---
# MOSAIC Dataset
This repository packages the public MOSAIC data artifacts from the paper **"MOSAIC: Multi-Objective Slice-Aware Iterative Curation for Alignment."**
`MOSAIC` is short for **Multi-Objective Slice-Aware Iterative Curation for Alignment**.
It contains three annotated source training pools and five training subsets selected by the MOSAIC search loop under a fixed 1M-token budget. The release also includes flattened iteration metadata so the search trajectory can be inspected directly inside the Hugging Face dataset viewer.
## Included configs
### `source_pools`
The three pre-scored training pools used by MOSAIC:
- `xguard`: 30695 rows
- `orbench`: 7927 rows
- `ifeval`: 9076 rows
These parquet files retain the slice-level supervision fields used by the search loop, such as `score`, `slice`, and `need`.
### `mosaic_search_subsets`
Five iteration-specific training subsets selected by the MOSAIC closed loop:
- `iter_00` to `iter_04`
Together they contain 7070 rows across the five released subsets.
### `mosaic_search_log`
A flattened table with one row for the baseline and one row per released iteration. It includes:
- per-objective scores
- dataset-level mixture weights
- bucket weights
- focus criteria
- selected row counts
- estimated token counts
- final train loss and runtime
The `mosaic_search_log` table has 6 rows.
## Data fields
### Source pools
The source pools are multi-turn or single-turn conversation records with nested `messages` fields. Depending on the source, they also contain:
- `metadata`
- `prompt`
- `constraints`
- `score`
- `slice`
- `need`
- task-specific diagnostic columns such as `_pressure`, `_concealment`, `_safety`, `_refusal_type`, `_help_level`, `_friction`, `_inst_complexity`, and `_failed_constraints`
### Search subsets
The released subset files contain the exact training slices sampled by the search loop. Common fields include:
- `messages`
- `source_id`
- `window_id`
- `metadata`
- `need`
- `prompt`
- `constraints`
- `n_windows_total`
## Safety note
This release includes safety-alignment data and therefore contains harmful, adversarial, and jailbreak-oriented prompts. It is intended for research on safety evaluation, over-refusal calibration, and alignment data construction. Do not deploy it as a user-facing dataset without an additional review layer.
## Redistribution note
This repository is a curated release of experiment artifacts. Some data pools are derived from upstream datasets and benchmarks. Before making the repository public, verify that redistribution is compatible with the licenses and terms of the original sources used to build these files.
## Summary statistics
- Total source-pool rows: 47698
- Total released search-subset rows: 7070
- Source-pool payload: 304.62 MB
- Search-subset payload: 12.77 MB
## Additional metadata
The following files are included outside the dataset configs:
- `metadata/iteration_summary.json`
- `metadata/config.json`
- `metadata/full_history.json`
- `metadata/pareto_archive.json`
- `metadata/final_report.md`
- `metadata/release_manifest.json`
These files are useful when reproducing the paper or inspecting the full search trajectory outside the dataset viewer.
提供机构:
douyipu-real



