recogna-nlp/Bode-mix-no-reasoning
收藏Hugging Face2026-02-26 更新2026-04-05 收录
下载链接:
https://hf-mirror.com/datasets/recogna-nlp/Bode-mix-no-reasoning
下载链接
链接失效反馈官方服务:
资源简介:
---
language:
- pt
pretty_name: Bode-Mix-No-Reasoning
size_categories:
- 1K<n<10K
task_categories:
- text-generation
---
# Bode-Mix-No-Reasoning
Bode-Mix-No-Reasoning is a complementary Portuguese-language dataset designed for direct question-answering fine-tuning of Large Language Models (LLMs) without intermediate reasoning traces. This dataset comprises 2,246 instances of open-ended questions and their corresponding answers, sourced from a general-purpose Portuguese instruction dataset and Brazilian university entrance examinations.
## Dataset Details
### Dataset Description
This dataset was created as a complementary component to the [Bode-Reasoning](https://huggingface.co/datasets/recogna-nlp/Bode-reasoning) dataset, providing direct question–answer pairs without intermediate reasoning steps. It combines general-purpose instruction-following instances from [cnmoro/GPT4-500k-Augmented-PTBR-Clean](https://huggingface.co/datasets/cnmoro/GPT4-500k-Augmented-PTBR-Clean) with open-ended university entrance exam questions from EduBench whose Gemini-2.5 PRO-generated reasoning traces scored below the 70% G-Eval similarity threshold. By excluding subpar reasoning traces while retaining high-quality expected answers, this dataset enables models to learn direct response generation alongside reasoning-augmented training.
### Key Features:
- Size: 2,246 instances (2,000 training instances and 246 test instances)
- Language: Brazilian Portuguese (PT-BR)
- Task Types: Open-ended questions, instruction-following
- Domains: General knowledge, education, diverse academic subjects
- Format: Direct question–answer pairs (no intermediate reasoning traces)
### Supported Tasks
- Direct Question Answering: Training models to produce concise, direct answers without intermediate reasoning steps
- Instruction Following: Fine-tuning LLMs for general-purpose Portuguese instruction-following tasks
- Complementary Training: Augmenting reasoning-focused datasets with non-reasoning instances to promote adaptive model behavior
## Dataset Structure
### Each instance in the dataset contains:
- id: Unique instance identifier that also refers to the original data source of the text
- input: The original question or instruction text (open-ended)
- output: The direct answer without intermediate reasoning traces
### Data Fields
```python
{
"id" : str, # Unique identifier
"input": str, # Question or instruction text
"output": str # Direct answer (no reasoning traces)
}
```
### Data Splits
| Split | Instances |
|-------|-----------|
| Train | 2,000 |
| Test | 246 |
## Dataset Creation
### Source Data
#### Base Datasets
1. **[cnmoro/GPT4-500k-Augmented-PTBR-Clean](https://huggingface.co/datasets/cnmoro/GPT4-500k-Augmented-PTBR-Clean)**
- Portuguese translation of the Open-Orca/1million-gpt-4 dataset
- Filtered to exclude programming-related content and non-Latin characters
- 1,014 instances (844 training and 170 test)
- Covers diverse general-knowledge and instruction-following tasks
2. **EduBench (Proprietary Dataset — G-Eval < 70%)**
- Open-ended questions from three Brazilian university entrance exams
- Years: 2015–2023 (excluding 2023 for testing)
- 1,232 instances (1,156 training and 76 test)
- Comprises questions whose Gemini-2.5 PRO-generated reasoning traces scored below the 70% G-Eval similarity threshold; only questions and their expected answers were retained, discarding the subpar intermediate reasoning traces
### Data Collection and Processing
#### Instance Selection and Filtering
1. **GPT4-500k-Augmented-PTBR-Clean:**
- Instances were randomly sampled from the original dataset
- An 80/20 train/test split was applied, yielding 844 training and 170 test instances
2. **EduBench (G-Eval < 70%):**
- Reasoning traces were initially generated using Gemini-2.5 PRO, selected based on its superior performance on the [Open Portuguese LLM Leaderboard](https://huggingface.co/spaces/eduagarcia/open_pt_llm_leaderboard)
- Quality was assessed using the G-Eval metric; instances whose reasoning traces scored below the 70% similarity threshold were excluded from the reasoning dataset ([Bode-Reasoning](https://huggingface.co/datasets/recogna-nlp/Bode-reasoning))
- Although the reasoning traces did not meet minimum quality criteria, the expected answers remained valuable; therefore, only the question–answer pairs were retained, discarding the intermediate reasoning traces
- Test instances correspond to questions from the 2023 exam edition
3. **Data Integration:**
- Combined instances from both source datasets
- Ensured format consistency across all sources (direct question–answer pairs without reasoning traces)
### Test Set Construction
A separate test set comprising 246 instances was created:
- GPT4-500k-Augmented-PTBR-Clean: 170 randomly sampled instances
- EduBench (G-Eval < 70%, holding 2023): 76 questions
### Data Composition
**By Source:**
- GPT4-500k-Augmented-PTBR-Clean: ~45%
- EduBench (G-Eval < 70%): ~55%
**By Question Type:**
- Open-ended: 100%
## References
**Key Source Datasets:**
```bibtex
# GPT4-500k-Augmented-PTBR-Clean
@misc{moro_2025_gpt4500kaugmentedptbrclean,
title = {GPT4-500k-Augmented-PTBR-Clean},
author = {Moro, Carlo},
year = 2025,
month = {07},
url = {https://huggingface.co/datasets/cnmoro/GPT4-500k-Augmented-PTBR-Clean},
urldate = {2025-10-17},
organization = {Huggingface.co}
}
```
## Usage
```python
from datasets import load_dataset
# Load the full dataset (all splits)
dataset = load_dataset("recogna-nlp/Bode-mix-no-reasoning")
# Load Train split
train_dataset = load_dataset("recogna-nlp/Bode-mix-no-reasoning", split="train")
# Load Test split
test_dataset = load_dataset("recogna-nlp/Bode-mix-no-reasoning", split="test")
```
## Citation
This work was accepted at **PROPOR 2026**. The full citation will be made available after the official publication of the proceedings.
## Changelog
### Version 1.0 (Initial Release)
- 2,246 instances (2,000 train / 246 test)
- Composed of general-purpose Portuguese instruction data and university entrance exam questions without reasoning traces
提供机构:
recogna-nlp



