five

recogna-nlp/Bode-mix-no-reasoning

收藏
Hugging Face2026-02-26 更新2026-04-05 收录
下载链接:
https://hf-mirror.com/datasets/recogna-nlp/Bode-mix-no-reasoning
下载链接
链接失效反馈
官方服务:
资源简介:
--- language: - pt pretty_name: Bode-Mix-No-Reasoning size_categories: - 1K<n<10K task_categories: - text-generation --- # Bode-Mix-No-Reasoning Bode-Mix-No-Reasoning is a complementary Portuguese-language dataset designed for direct question-answering fine-tuning of Large Language Models (LLMs) without intermediate reasoning traces. This dataset comprises 2,246 instances of open-ended questions and their corresponding answers, sourced from a general-purpose Portuguese instruction dataset and Brazilian university entrance examinations. ## Dataset Details ### Dataset Description This dataset was created as a complementary component to the [Bode-Reasoning](https://huggingface.co/datasets/recogna-nlp/Bode-reasoning) dataset, providing direct question–answer pairs without intermediate reasoning steps. It combines general-purpose instruction-following instances from [cnmoro/GPT4-500k-Augmented-PTBR-Clean](https://huggingface.co/datasets/cnmoro/GPT4-500k-Augmented-PTBR-Clean) with open-ended university entrance exam questions from EduBench whose Gemini-2.5 PRO-generated reasoning traces scored below the 70% G-Eval similarity threshold. By excluding subpar reasoning traces while retaining high-quality expected answers, this dataset enables models to learn direct response generation alongside reasoning-augmented training. ### Key Features: - Size: 2,246 instances (2,000 training instances and 246 test instances) - Language: Brazilian Portuguese (PT-BR) - Task Types: Open-ended questions, instruction-following - Domains: General knowledge, education, diverse academic subjects - Format: Direct question–answer pairs (no intermediate reasoning traces) ### Supported Tasks - Direct Question Answering: Training models to produce concise, direct answers without intermediate reasoning steps - Instruction Following: Fine-tuning LLMs for general-purpose Portuguese instruction-following tasks - Complementary Training: Augmenting reasoning-focused datasets with non-reasoning instances to promote adaptive model behavior ## Dataset Structure ### Each instance in the dataset contains: - id: Unique instance identifier that also refers to the original data source of the text - input: The original question or instruction text (open-ended) - output: The direct answer without intermediate reasoning traces ### Data Fields ```python { "id" : str, # Unique identifier "input": str, # Question or instruction text "output": str # Direct answer (no reasoning traces) } ``` ### Data Splits | Split | Instances | |-------|-----------| | Train | 2,000 | | Test | 246 | ## Dataset Creation ### Source Data #### Base Datasets 1. **[cnmoro/GPT4-500k-Augmented-PTBR-Clean](https://huggingface.co/datasets/cnmoro/GPT4-500k-Augmented-PTBR-Clean)** - Portuguese translation of the Open-Orca/1million-gpt-4 dataset - Filtered to exclude programming-related content and non-Latin characters - 1,014 instances (844 training and 170 test) - Covers diverse general-knowledge and instruction-following tasks 2. **EduBench (Proprietary Dataset — G-Eval < 70%)** - Open-ended questions from three Brazilian university entrance exams - Years: 2015–2023 (excluding 2023 for testing) - 1,232 instances (1,156 training and 76 test) - Comprises questions whose Gemini-2.5 PRO-generated reasoning traces scored below the 70% G-Eval similarity threshold; only questions and their expected answers were retained, discarding the subpar intermediate reasoning traces ### Data Collection and Processing #### Instance Selection and Filtering 1. **GPT4-500k-Augmented-PTBR-Clean:** - Instances were randomly sampled from the original dataset - An 80/20 train/test split was applied, yielding 844 training and 170 test instances 2. **EduBench (G-Eval < 70%):** - Reasoning traces were initially generated using Gemini-2.5 PRO, selected based on its superior performance on the [Open Portuguese LLM Leaderboard](https://huggingface.co/spaces/eduagarcia/open_pt_llm_leaderboard) - Quality was assessed using the G-Eval metric; instances whose reasoning traces scored below the 70% similarity threshold were excluded from the reasoning dataset ([Bode-Reasoning](https://huggingface.co/datasets/recogna-nlp/Bode-reasoning)) - Although the reasoning traces did not meet minimum quality criteria, the expected answers remained valuable; therefore, only the question–answer pairs were retained, discarding the intermediate reasoning traces - Test instances correspond to questions from the 2023 exam edition 3. **Data Integration:** - Combined instances from both source datasets - Ensured format consistency across all sources (direct question–answer pairs without reasoning traces) ### Test Set Construction A separate test set comprising 246 instances was created: - GPT4-500k-Augmented-PTBR-Clean: 170 randomly sampled instances - EduBench (G-Eval < 70%, holding 2023): 76 questions ### Data Composition **By Source:** - GPT4-500k-Augmented-PTBR-Clean: ~45% - EduBench (G-Eval < 70%): ~55% **By Question Type:** - Open-ended: 100% ## References **Key Source Datasets:** ```bibtex # GPT4-500k-Augmented-PTBR-Clean @misc{moro_2025_gpt4500kaugmentedptbrclean, title = {GPT4-500k-Augmented-PTBR-Clean}, author = {Moro, Carlo}, year = 2025, month = {07}, url = {https://huggingface.co/datasets/cnmoro/GPT4-500k-Augmented-PTBR-Clean}, urldate = {2025-10-17}, organization = {Huggingface.co} } ``` ## Usage ```python from datasets import load_dataset # Load the full dataset (all splits) dataset = load_dataset("recogna-nlp/Bode-mix-no-reasoning") # Load Train split train_dataset = load_dataset("recogna-nlp/Bode-mix-no-reasoning", split="train") # Load Test split test_dataset = load_dataset("recogna-nlp/Bode-mix-no-reasoning", split="test") ``` ## Citation This work was accepted at **PROPOR 2026**. The full citation will be made available after the official publication of the proceedings. ## Changelog ### Version 1.0 (Initial Release) - 2,246 instances (2,000 train / 246 test) - Composed of general-purpose Portuguese instruction data and university entrance exam questions without reasoning traces
提供机构:
recogna-nlp
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作