five

anttip/Tunesets_Edu_v4

收藏
Hugging Face2026-02-17 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/anttip/Tunesets_Edu_v4
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: mit task_categories: - text-generation language: - en - fr - de - pt - it - id - es - ru - sm - mi - vi - tr - nl - th - zh - el - ja size_categories: - 1M<n<10M --- # Tunesets_Edu_v4 ## Overview A filtered high-quality dataset blend for finetuning education-domain LLMs. The task focus is on non-reasoning instruction following, at <131k token context. The domain focus in on non-code and non-math tasks, including multi-lingual data. This dataset filters and samples data from following datasets. Only commercially usable subsets of the datasets have been included. - [allenai/WildChat-4.8M](https://huggingface.co/datasets/allenai/WildChat-4.8M) - [arcee-ai/The-Tome](https://huggingface.co/datasets/arcee-ai/The-Tome) - [microsoft/orca-agentinstruct-1M-v1](https://huggingface.co/datasets/microsoft/orca-agentinstruct-1M-v1) - [MaziyarPanahi/Llama-Nemotron-Post-Training-Dataset-v1-ShareGPT](https://huggingface.co/datasets/MaziyarPanahi/Llama-Nemotron-Post-Training-Dataset-v1-ShareGPT) - [HuggingFaceTB/smoltalk](https://huggingface.co/datasets/HuggingFaceTB/smoltalk) - [CohereLabs/aya_collection_language_split](https://huggingface.co/datasets/CohereLabs/aya_collection_language_split) - [allenai/tulu-3-sft-mixture](https://huggingface.co/datasets/allenai/tulu-3-sft-mixture) - [TIGER-Lab/WebInstruct-CFT](https://huggingface.co/datasets/TIGER-Lab/WebInstruct-CFT) - [nvidia/OpenScience](https://huggingface.co/datasets/nvidia/OpenScience) - [HuggingFaceTB/smoltalk2](https://huggingface.co/datasets/HuggingFaceTB/smoltalk2) - [HuggingFaceH4/ultrachat_200k](https://huggingface.co/datasets/HuggingFaceH4/ultrachat_200k) - [teknium/OpenHermes-2.5](https://huggingface.co/datasets/teknium/OpenHermes-2.5) - [prometheus-eval/Preference-Collection](https://huggingface.co/datasets/prometheus-eval/Preference-Collection) - [argilla/magpie-ultra-v1.0](https://huggingface.co/datasets/argilla/magpie-ultra-v1.0) - [NousResearch/Hermes-3-Dataset](https://huggingface.co/datasets/NousResearch/Hermes-3-Dataset) - [ytz20/LMSYS-Chat-GPT-5-Chat-Response](https://huggingface.co/datasets/ytz20/LMSYS-Chat-GPT-5-Chat-Response) - [sequelbox/Celestia3-DeepSeek-R1-0528](https://huggingface.co/datasets/sequelbox/Celestia3-DeepSeek-R1-0528) - [umd-zhou-lab/Reflect_WizV2_All](https://huggingface.co/datasets/umd-zhou-lab/Reflect_WizV2_All) - [arcee-ai/EvolKit-75K](https://huggingface.co/datasets/arcee-ai/EvolKit-75K) - [sequelbox/Raiden-DeepSeek-R1](https://huggingface.co/datasets/sequelbox/Raiden-DeepSeek-R1) - [umd-zhou-lab/Reflect_Wiz70_All](https://huggingface.co/datasets/umd-zhou-lab/Reflect_Wiz70_All) - [prometheus-eval/Feedback-Collection](https://huggingface.co/datasets/prometheus-eval/Feedback-Collection) - [umd-zhou-lab/Reflect_Alpaca_All](https://huggingface.co/datasets/umd-zhou-lab/Reflect_Alpaca_All) - [allenai/WildChat-1M](https://huggingface.co/datasets/allenai/WildChat-1M) - [Open-Orca/SlimOrca-Dedup](https://huggingface.co/datasets/Open-Orca/SlimOrca-Dedup) - [CohereLabs/aya_dataset](https://huggingface.co/datasets/CohereLabs/aya_dataset) - [LDJnr/Capybara](https://huggingface.co/datasets/LDJnr/Capybara) - [grammarly/coedit](https://huggingface.co/datasets/grammarly/coedit) A subset of languages from aya_collection_language_split were selected to form a new dataset "aya_collection_merged": french, german, spanish, italian, indonesian, japanese, chinese, standard_arabic, dutch, greek, korean, standard_malay, maori, portuguese, samoan, thai, turkish The data from the datasets is exactly as in the originals. Only filtering and sampling has been applied to get a higher-quality dataset. ## Data Filtering and Sampling The datasets were processed in the order: 1. Rule-based noise and length filtering 2. Deduplication of conversations using Minhash and string similarities 3. Filtering and balanced sampling based on LLM classifications The datasets rows were classified for quality and categories using AWQ-quantized LLMs: - [arcee-ai/SuperNova-Medius](https://huggingface.co/AMead10/SuperNova-Medius-AWQ) - [arcee-ai/Arcee-Blitz](https://huggingface.co/arcee-ai/Arcee-Blitz-AWQ) (for <32k context tokens) - [Qwen/Qwen3-14B-AWQ](https://huggingface.co/Qwen/Qwen3-14B-AWQ) (for >=32k context tokens) The following prompt was used, and the classifications between these two models were merged. ``` You are a senior data analyst. The following is a discussion between a human user and AI assistant. Evaluate the discussion and the performance of the AI, and fill the following json template: { "discussion_language": # Main language of the discussion. "discussion_category": # Task category of the discussion. 1 or 2 keywords. "response_difficulty": # Level of expertise required in the topic. Easy/Medium/Hard "response_quality": # Quality of the assistant's responses. Bad/Average/Good "response_complete": # The AI gives complete responses to the requests. Yes/No "response_errors": # The AI responses contain a clear error. Yes/No "response_concise": # The AI responses are concise with no irrelevant parts. Yes/No "overall_grade": # Overall grade of the discussion as LLM finetuning data. From 1 to 5, where 1 is useless, 5 is perfect. } Don't give any explanations, just fill the above json template. Here's the discussion to evaluate: ``` ## Safety Filtering Safety classification was used to filter all unsafe rows. Each conversation is classified as either SAFE or TOXIC using LLM-based evaluation. The safety classification was performed using: - [arcee-ai/Arcee-Blitz](https://huggingface.co/arcee-ai/Arcee-Blitz-AWQ) (for <32k context tokens) - [Qwen/Qwen3-14B-AWQ](https://huggingface.co/Qwen/Qwen3-14B-AWQ) (for >=32k context tokens) The following prompt was used to classify each conversation: ``` Classify the following conversation as SAFE or TOXIC for use as LLM training data. Respond with ONLY one word: either "SAFE" or "TOXIC" TOXIC content includes: - Hate speech, discrimination, or harassment - Violence, gore, or harmful instructions - Sexual content involving minors - Personal attacks or bullying - Dangerous misinformation (medical, legal, safety) - Malicious code or hacking instructions - Privacy violations or doxxing - Fraud or scams SAFE content includes: - Educational discussions - Helpful assistance and advice - Creative writing (fiction) - Technical information - General conversation Conversation to classify: ``` Only conversations classified as SAFE are included in the final dataset. ## Dataset Statistics Row frequencies of the source repositories in the resulting sample: ``` allenai/WildChat-4.8M 773579 arcee-ai/The-Tome 659978 microsoft/orca-agentinstruct-1M-v1 638897 MaziyarPanahi/Llama-Nemotron-Post-Training-Dataset-v1-ShareGPT 501775 HuggingFaceTB/smoltalk 471575 CohereForAI/aya_collection_merged 348836 TIGER-Lab/WebInstruct-CFT 280907 allenai/tulu-3-sft-mixture 277537 nvidia/OpenScience 269113 HuggingFaceTB/smoltalk2 247314 HuggingFaceH4/ultrachat_200k 203824 teknium/OpenHermes-2.5 196936 prometheus-eval/Preference-Collection 175898 argilla/magpie-ultra-v1.0 173497 NousResearch/Hermes-3-Dataset 94556 sequelbox/Celestia3-DeepSeek-R1-0528 85190 ytz20/LMSYS-Chat-GPT-5-Chat-Response 84297 umd-zhou-lab/Reflect_WizV2_All 56941 arcee-ai/EvolKit-75K 49603 sequelbox/Raiden-DeepSeek-R1 43918 umd-zhou-lab/Reflect_Wiz70_All 42851 prometheus-eval/Feedback-Collection 32375 umd-zhou-lab/Reflect_Alpaca_All 30105 Open-Orca/SlimOrca-Dedup 19530 allenai/WildChat-1M 19354 CohereLabs/aya_dataset 12682 LDJnr/Capybara 2872 grammarly/coedit 2516 ``` The top 20 most common categories in the dataset: ``` Physics, Mathematics 42489 Mathematics, Problem Solving 33835 Mathematics, Geometry 31335 Image Prompt Generation 28147 Mathematics, Calculus 27966 Geometry, Mathematics 26560 Translation, Data Processing 24404 Mathematics, Education 22649 Math Problem 22482 Mathematics, Algebra 22420 Biology, Genetics 21576 Probability, Statistics 21108 Mathematics, Data Analysis 19610 Text Classification 18927 Chemistry, Organic Chemistry 18349 Translation, Cooking 16739 Data Analysis, Statistics 16448 Creative Writing, Character Development 16314 Physics, Quantum Mechanics 14819 Python, Data Analysis 14521 ``` The top 20 most common languages in the dataset: ``` English 4833842 French 118037 German 89594 Portuguese 87260 Italian 71539 Indonesian 61299 Spanish 56537 Russian 55561 español 51282 Māori 43083 Samoan 40630 Vietnamese 31580 Turkish 28118 Dutch 26592 Thai 22784 en 19565 fr 18946 Chinese 13571 Greek 11074 Japanese 10749 ```
提供机构:
anttip
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作