five

togethercomputer/aurora

收藏
Hugging Face2026-03-27 更新2026-04-05 收录
下载链接:
https://hf-mirror.com/datasets/togethercomputer/aurora
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: apache-2.0 task_categories: - text-generation - question-answering - text2text-generation language: - en tags: - code - math - reasoning - chat - finance pretty_name: Online SD Dataset size_categories: - 100K<n<1M configs: - config_name: chat data_files: - split: train path: chat/train/*.jsonl - split: test path: chat/test/*.jsonl - config_name: code data_files: - split: train path: code/train/*.jsonl - split: test path: code/test/*.jsonl - split: eval path: code/eval/*.jsonl - config_name: commonsense data_files: - split: train path: commonsense/train/*.jsonl - split: test path: commonsense/test/*.jsonl - split: validation path: commonsense/validation/*.jsonl - config_name: finance data_files: - split: train path: finance/train/*.jsonl - split: eval path: finance/eval/*.jsonl - config_name: math data_files: - split: train path: math/train/*.jsonl - split: test path: math/test/*.jsonl - config_name: merged_chat data_files: - split: train path: merged/merged_chat_train_shuffled.jsonl - config_name: merged_code data_files: - split: train path: merged/merged_code_train_shuffled.jsonl - config_name: merged_commonsense data_files: - split: train path: merged/merged_commonsense_train_shuffled.jsonl - config_name: merged_finance data_files: - split: train path: merged/merged_finance_train_shuffled.jsonl - config_name: merged_math data_files: - split: train path: merged/merged_math_train_shuffled.jsonl --- # Online SD Dataset A comprehensive multi-domain training dataset with **619,177 samples** covering code generation, mathematical reasoning, conversational AI, commonsense reasoning, and financial QA. ## 🌟 Key Features - **Multi-Domain Coverage**: 5 major domains with diverse tasks - **Pre-Merged Files**: Ready-to-use merged files for each domain - **Unified Format**: Consistent conversational structure across all datasets - **High Quality**: Curated from well-known open-source datasets - **Flexible Loading**: Load by domain, source, or custom combinations ## 📊 Dataset Overview | Domain | Train Samples | Test/Val Samples | Sources | |--------|--------------|------------------|---------| | 💬 **Chat** | 100,000 | 200 | Chatbot Instructions | | 💻 **Code** | 200,764 | 564 | CodeSearchNet, MBPP, Tiny-Codes, HumanEval | | 🧠 **Commonsense** | 101,913 | 1,200 | WinoGrande, Social IQA, PIQA, CommonsenseQA, ARC | | 💰 **Finance** | 68,712 | 200 | Finance Alpaca | | 🔢 **Math** | 147,788 | 400 | GSM8K, Math Dataset, DeepScaleR | | **Total** | **619,177** | **2,564** | 13 datasets | ## 📁 Dataset Structure ### Domain Organization ``` onlinesd/ ├── chat/ │ ├── train/ # 100K chat/instruction samples │ └── test/ # 200 test samples ├── code/ │ ├── train/ # 200K+ code generation samples │ ├── test/ # Test samples │ └── eval/ # Evaluation samples ├── commonsense/ │ ├── train/ # 100K+ commonsense reasoning │ ├── test/ │ └── validation/ ├── finance/ │ ├── train/ # 68K finance domain samples │ └── eval/ ├── math/ │ ├── train/ # 147K math problem-solving │ └── test/ └── merged/ # Pre-merged and shuffled files by domain ├── merged_chat_train_shuffled.jsonl ├── merged_code_train_shuffled.jsonl ├── merged_commonsense_train_shuffled.jsonl ├── merged_finance_train_shuffled.jsonl └── merged_math_train_shuffled.jsonl ``` ### 🎯 Merged Files (Recommended for Training) The `merged/` directory contains pre-combined and **shuffled** files for each domain, saving you time on data preprocessing: | File | Samples | Size | Description | |------|---------|------|-------------| | `merged_chat_train_shuffled.jsonl` | 100,000 | 14 MB | All chat & instruction-following data (shuffled) | | `merged_code_train_shuffled.jsonl` | 200,764 | 82 MB | All code generation data from 3 sources (shuffled) | | `merged_commonsense_train_shuffled.jsonl` | 101,913 | 24 MB | All commonsense reasoning from 5 datasets (shuffled) | | `merged_finance_train_shuffled.jsonl` | 68,712 | 9 MB | All financial domain QA (shuffled) | | `merged_math_train_shuffled.jsonl` | 147,788 | 27 MB | All math problem-solving from 3 sources (shuffled) | **Benefits of using merged files:** - ✅ No manual merging needed - ✅ Consistent formatting - ✅ Pre-shuffled for training (seed=42) - ✅ Source diversity maintained - ✅ Faster loading - ✅ Easy domain mixing ## 🚀 Quick Start ### Installation ```bash pip install datasets ``` ### Load Merged Files (Recommended) ```python from datasets import load_dataset # Load a single domain (shuffled) math_data = load_dataset( "zelc/onlinesd", data_files="merged/merged_math_train_shuffled.jsonl", split="train" ) print(f"Math samples: {len(math_data)}") # Load multiple domains multi_domain = load_dataset( "zelc/onlinesd", data_files={ "math": "merged/merged_math_train_shuffled.jsonl", "code": "merged/merged_code_train_shuffled.jsonl", "chat": "merged/merged_chat_train_shuffled.jsonl" } ) print(multi_domain) # DatasetDict({ # math: Dataset # code: Dataset # chat: Dataset # }) ``` ### Load by Configuration ```python # Load all math data (train + test splits) math_dataset = load_dataset("zelc/onlinesd", "math") # Load only training split code_train = load_dataset("zelc/onlinesd", "code", split="train") # Load using merged config merged_math = load_dataset("zelc/onlinesd", "merged_math") ``` ### Load Specific Source Files ```python # Load a specific source dataset gsm8k = load_dataset( "zelc/onlinesd", data_files="math/train/gsm8k_train.jsonl" ) # Load specific test set arc_test = load_dataset( "zelc/onlinesd", data_files="commonsense/test/allenai_ai2_arc_test.jsonl" ) ``` ## 📝 Data Format All samples follow a unified conversational format: ```json { "id": "dataset_source_index", "conversations": [ { "role": "user", "content": "What is 25 * 4?" }, { "role": "assistant", "content": "25 * 4 = 100" } ] } ``` **Fields:** - `id`: Unique identifier (format: `{dataset_name}_{index}`) - `conversations`: List of conversation turns - `role`: Either "user" or "assistant" (some may include "system") - `content`: The message content **Note:** Test/evaluation samples typically only include the user prompt (no assistant response). ## 📚 Detailed Domain Information ### 💬 Chat (100,000 samples) **Purpose**: Instruction following and conversational AI training **Sources:** - `alespalla/chatbot_instruction_prompts` (100K samples) **Use Cases**: General instruction following, task completion, dialogue systems --- ### 💻 Code (200,764 samples) **Purpose**: Code generation and programming assistance **Sources:** - **CodeSearchNet** (100K, 49.81%): Function generation from docstrings - **Tiny-Codes** (99.8K, 49.71%): Short code snippets - **MBPP** (964, 0.48%): Python programming problems - **HumanEval** (test only): Canonical code evaluation **Languages**: Primarily Python, with some multi-language support **Use Cases**: Code completion, docstring-to-code, programming problem solving --- ### 🧠 Commonsense (101,913 samples) **Purpose**: Commonsense and social reasoning **Sources:** - **WinoGrande** (40.4K, 39.64%): Pronoun resolution requiring commonsense - **Social IQA** (33.4K, 32.78%): Social situation reasoning - **PIQA** (16.1K, 15.81%): Physical commonsense about everyday situations - **CommonsenseQA** (9.7K, 9.56%): Multiple-choice commonsense QA - **AI2 ARC** (2.3K, 2.21%): Science exam questions requiring reasoning **Format**: Most are multiple-choice with context and options **Use Cases**: Commonsense reasoning, social understanding, everyday situation prediction --- ### 💰 Finance (68,712 samples) **Purpose**: Financial domain question answering and analysis **Sources:** - **Finance Alpaca** (68.7K, 100%): Financial instruction-following dataset **Topics**: Investment, financial concepts, market analysis, financial advice **Use Cases**: Financial QA systems, investment advisory, financial education --- ### 🔢 Math (147,788 samples) **Purpose**: Mathematical problem solving and reasoning **Sources:** - **Math Dataset** (100K, 67.66%): Algebra and arithmetic problems - **DeepScaleR** (40.3K, 27.28%): Advanced math reasoning - **GSM8K** (7.5K, 5.06%): Grade school math word problems **Difficulty**: Ranges from elementary arithmetic to advanced problem solving **Use Cases**: Math tutoring, problem solving, step-by-step reasoning ## 💡 Usage Tips ### For Training ```python from datasets import load_dataset, concatenate_datasets # Mix multiple domains with custom ratios math = load_dataset("zelc/onlinesd", data_files="merged/merged_math_train_shuffled.jsonl", split="train") code = load_dataset("zelc/onlinesd", data_files="merged/merged_code_train_shuffled.jsonl", split="train") # Sample and combine math_sample = math.shuffle(seed=42).select(range(50000)) code_sample = code.shuffle(seed=42).select(range(50000)) mixed = concatenate_datasets([math_sample, code_sample]).shuffle(seed=42) ``` ### For Evaluation ```python # Load test sets math_test = load_dataset("zelc/onlinesd", "math", split="test") commonsense_test = load_dataset("zelc/onlinesd", "commonsense", split="test") # Evaluate on specific benchmarks gsm8k_test = load_dataset( "zelc/onlinesd", data_files="math/test/gsm8k_test.jsonl", split="train" # Note: using split="train" when loading from data_files ) ``` ### Domain-Specific Training ```python # Train a math specialist math_data = load_dataset("zelc/onlinesd", "merged_math", split="train") # Train a code specialist code_data = load_dataset("zelc/onlinesd", "merged_code", split="train") # Train a reasoning specialist reasoning_data = load_dataset( "zelc/onlinesd", data_files={ "commonsense": "merged/merged_commonsense_train_shuffled.jsonl", "math": "merged/merged_math_train_shuffled.jsonl" } ) ``` ## 📈 Dataset Statistics Summary ### Training Data Distribution ``` Code ████████████████████ 200,764 (32.4%) Math ███████████████ 147,788 (23.9%) Commonsense████████████ 101,913 (16.5%) Chat ████████████ 100,000 (16.1%) Finance ████████ 68,712 (11.1%) ``` ### Test/Validation Data | Domain | Test | Validation | Eval | Total | |--------|------|------------|------|-------| | Commonsense | 400 | 800 | - | 1,200 | | Code | 364 | - | 200 | 564 | | Math | 400 | - | - | 400 | | Finance | - | - | 200 | 200 | | Chat | 200 | - | - | 200 | | **Total** | **1,364** | **800** | **400** | **2,564** | ## 🔗 Source Datasets This dataset combines and reformats the following open-source datasets: - [CodeSearchNet](https://github.com/github/CodeSearchNet) - [MBPP](https://github.com/google-research/google-research/tree/master/mbpp) - [Tiny-Codes](https://huggingface.co/datasets/nampdn-ai/tiny-codes) - [HumanEval](https://github.com/openai/human-eval) - [GSM8K](https://github.com/openai/grade-school-math) - [DeepScaleR](https://huggingface.co/datasets/agentica-org/DeepScaleR-Preview-Dataset) - [Math Dataset](https://github.com/deepmind/mathematics_dataset) - [WinoGrande](https://winogrande.allenai.org/) - [Social IQA](https://allenai.org/data/socialiqa) - [PIQA](https://yonatanbisk.com/piqa/) - [CommonsenseQA](https://www.tau-nlp.org/commonsenseqa) - [AI2 ARC](https://allenai.org/data/arc) - [Finance Alpaca](https://huggingface.co/datasets/gbharti/finance-alpaca) ## 📄 License Apache 2.0 Please also respect the licenses of the original source datasets. ## 🙏 Citation If you use this dataset in your research, please cite the original sources. You can also cite this dataset as: ```bibtex @dataset{onlinesd2024, title={Online SD Dataset: A Multi-Domain Training Collection}, author={Online SD Team}, year={2024}, publisher={Hugging Face}, url={https://huggingface.co/datasets/zelc/onlinesd} } ``` ## 📧 Contact For questions, suggestions, or issues: - Open an issue in the [Discussion forum](https://huggingface.co/datasets/zelc/onlinesd/discussions) - Report bugs via the Issues tab ## 🔄 Updates - **2024-01**: Initial release with 619K training samples across 5 domains - Includes pre-merged files for convenient training --- **Made with ❤️ for the open-source community**
提供机构:
togethercomputer
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作