five

abhinav0231/Sarvam-105b-Distill-100k

收藏
Hugging Face2026-04-10 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/abhinav0231/Sarvam-105b-Distill-100k
下载链接
链接失效反馈
官方服务:
资源简介:
--- pretty_name: Sarvam 105B Distill 100K Single Turn license: apache-2.0 language: - en task_categories: - text-generation tags: - reasoning - distillation - chatml - sharegpt - thinking size_categories: - 100K<n<1M configs: - config_name: thinking data_files: - split: train path: thinking/train.jsonl - split: validation path: thinking/validation.jsonl - split: test path: thinking/test.jsonl - config_name: sharegpt data_files: - split: train path: sharegpt/train.jsonl - split: validation path: sharegpt/validation.jsonl - split: test path: sharegpt/test.jsonl - config_name: chatml data_files: - split: train path: chatml/train.jsonl - split: validation path: chatml/validation.jsonl - split: test path: chatml/test.jsonl - config_name: simple_qa data_files: - split: train path: simple_qa/train.jsonl - split: validation path: simple_qa/validation.jsonl - split: test path: simple_qa/test.jsonl --- # Sarvam 105B Distill 100K Single Turn ## Dataset Summary Single turn Science, math, code, law, health, history, geography and economics reasoning distillation from Sarvam 105B model. ## Source - Input JSONL: distillation_pipeline\dataset_final_p1_100k\full_dataset.jsonl - Generated at: 2026-04-10T08:07:21.157322+00:00 ## Splits - Train: 96000 - Validation: 2000 - Test: 2000 ## Distribution Counts ### Domain - coding_computer_science: 16667 - creative_planning_openended: 3809 - economics_finance: 6667 - health_medicine: 4762 - history_geography_civics: 7619 - language_writing_rhetoric: 8571 - law_ethics: 5238 - logic_formal_reasoning: 12381 - mathematics: 19048 - science_stem: 15238 ### Difficulty - easy: 19532 - hard: 23706 - medium: 56762 ### Phase - 1: 100000 ### Language - english: 100000 ### Turn Type - single: 100000 ## Token Budget - Prompt tokens: 19536445 - Completion tokens: 172851392 - Total tokens: 192387837 ## Coverage - Unique subskills: 77 - Unique question formats: 67 ## Multi-turn Conversation Length Distribution - No multi-turn conversations present in this run ## Quality Score Per Domain - If quality_score is missing in source records, this section remains empty. - quality_score unavailable in source records ## Schemas ### thinking (primary) - Native reasoning-preserving schema with separate thinking and response fields. - Fields: messages, thinking, response. ### sharegpt (compatibility) - ShareGPT-compatible conversations schema. - Final assistant turn includes <think>...</think> followed by final answer. ### chatml (tokenizer-ready) - Preformatted ChatML text for direct tokenizer pipelines. - Uses <|im_start|> and <|im_end|> markers. ### simple_qa (tabular-friendly) - Flat schema for supervised finetuning and analytics. - Fields: system_prompt, question, thinking, answer.
提供机构:
abhinav0231
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作