abhinav0231/Sarvam-105b-Distill-100k
收藏Hugging Face2026-04-10 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/abhinav0231/Sarvam-105b-Distill-100k
下载链接
链接失效反馈官方服务:
资源简介:
---
pretty_name: Sarvam 105B Distill 100K Single Turn
license: apache-2.0
language:
- en
task_categories:
- text-generation
tags:
- reasoning
- distillation
- chatml
- sharegpt
- thinking
size_categories:
- 100K<n<1M
configs:
- config_name: thinking
data_files:
- split: train
path: thinking/train.jsonl
- split: validation
path: thinking/validation.jsonl
- split: test
path: thinking/test.jsonl
- config_name: sharegpt
data_files:
- split: train
path: sharegpt/train.jsonl
- split: validation
path: sharegpt/validation.jsonl
- split: test
path: sharegpt/test.jsonl
- config_name: chatml
data_files:
- split: train
path: chatml/train.jsonl
- split: validation
path: chatml/validation.jsonl
- split: test
path: chatml/test.jsonl
- config_name: simple_qa
data_files:
- split: train
path: simple_qa/train.jsonl
- split: validation
path: simple_qa/validation.jsonl
- split: test
path: simple_qa/test.jsonl
---
# Sarvam 105B Distill 100K Single Turn
## Dataset Summary
Single turn Science, math, code, law, health, history, geography and economics reasoning distillation from Sarvam 105B model.
## Source
- Input JSONL: distillation_pipeline\dataset_final_p1_100k\full_dataset.jsonl
- Generated at: 2026-04-10T08:07:21.157322+00:00
## Splits
- Train: 96000
- Validation: 2000
- Test: 2000
## Distribution Counts
### Domain
- coding_computer_science: 16667
- creative_planning_openended: 3809
- economics_finance: 6667
- health_medicine: 4762
- history_geography_civics: 7619
- language_writing_rhetoric: 8571
- law_ethics: 5238
- logic_formal_reasoning: 12381
- mathematics: 19048
- science_stem: 15238
### Difficulty
- easy: 19532
- hard: 23706
- medium: 56762
### Phase
- 1: 100000
### Language
- english: 100000
### Turn Type
- single: 100000
## Token Budget
- Prompt tokens: 19536445
- Completion tokens: 172851392
- Total tokens: 192387837
## Coverage
- Unique subskills: 77
- Unique question formats: 67
## Multi-turn Conversation Length Distribution
- No multi-turn conversations present in this run
## Quality Score Per Domain
- If quality_score is missing in source records, this section remains empty.
- quality_score unavailable in source records
## Schemas
### thinking (primary)
- Native reasoning-preserving schema with separate thinking and response fields.
- Fields: messages, thinking, response.
### sharegpt (compatibility)
- ShareGPT-compatible conversations schema.
- Final assistant turn includes <think>...</think> followed by final answer.
### chatml (tokenizer-ready)
- Preformatted ChatML text for direct tokenizer pipelines.
- Uses <|im_start|> and <|im_end|> markers.
### simple_qa (tabular-friendly)
- Flat schema for supervised finetuning and analytics.
- Fields: system_prompt, question, thinking, answer.
提供机构:
abhinav0231



