ssf-synthetic-data-for-retriever
收藏数据集概述:ssf-synthetic-data-for-retriever
数据集基本信息
- 数据集名称:ssf-synthetic-data-for-retriever
- 数据集大小:n<1K
- 标签:synthetic, distilabel, rlaif
- 创建工具:distilabel
数据集配置
数据集包含以下三个配置:
1. generate_retrieval_pairs_easy
- 特征:
- Sector: string
- Track: string
- Job Role: string
- anchor: string
- Performance Expectation: string
- positive: string
- negative: string
- distilabel_metadata: struct
- raw_input_generate_retrieval_pairs_easy: list
- content: string
- role: string
- raw_output_generate_retrieval_pairs_easy: string
- statistics_generate_retrieval_pairs_easy: struct
- input_tokens: int64
- output_tokens: int64
- raw_input_generate_retrieval_pairs_easy: list
- model_name: string
- 数据分割:
- train: 1,885 个样本,10,169,548 字节
- 下载大小:2,774,194 字节
- 数据集大小:10,169,548 字节
2. generate_retrieval_pairs_easy_v2
- 特征:
- 同 generate_retrieval_pairs_easy,但特征名称中的 "easy" 替换为 "easy_v2"
- 数据分割:
- train: 1,885 个样本,10,177,804 字节
- 下载大小:2,784,141 字节
- 数据集大小:10,177,804 字节
3. generate_retrieval_pairs_hard
- 特征:
- 同 generate_retrieval_pairs_easy,但特征名称中的 "easy" 替换为 "hard"
- 数据分割:
- train: 1,885 个样本,10,861,267 字节
- 下载大小:2,867,956 字节
- 数据集大小:10,861,267 字节
数据集结构示例
generate_retrieval_pairs_hard 示例
json { "Job Role": "Audit Associate / Audit Assistant Associate", "Performance Expectation": "In accordance with: Singapore Standards on Auditing, Ethics Pronouncements in Singapore, Singapore Companies Act, and Singapore Financial Reporting Standards", "Sector": "Accountancy", "Track": "Assurance", "anchor": "The Audit Associate/Audit Assistant Associate undertakes specific stages of audit work under supervision...", "distilabel_metadata": { "raw_input_generate_retrieval_pairs_hard": [...], "raw_output_generate_retrieval_pairs_hard": "## Positive audit assistant associate job description
Negative
risk management analyst", "statistics_generate_retrieval_pairs_hard": { "input_tokens": 606, "output_tokens": 15 } }, "model_name": "Qwen/Qwen2.5-VL-3B-Instruct", "negative": "risk management analyst", "positive": "audit assistant associate job description" }
加载方式
python from datasets import load_dataset
加载 generate_retrieval_pairs_hard
ds = load_dataset("dnth/ssf-synthetic-data-for-retriever", "generate_retrieval_pairs_hard")
加载 generate_retrieval_pairs_easy
ds = load_dataset("dnth/ssf-synthetic-data-for-retriever", "generate_retrieval_pairs_easy")
加载 generate_retrieval_pairs_easy_v2
ds = load_dataset("dnth/ssf-synthetic-data-for-retriever", "generate_retrieval_pairs_easy_v2")
数据集生成
数据集可通过 distilabel CLI 使用提供的 pipeline.yaml 文件重新生成:
console
distilabel pipeline run --config "https://huggingface.co/datasets/dnth/ssf-synthetic-data-for-retriever/raw/main/pipeline.yaml"




