essobi/trl_gair_lima_v19
收藏Hugging Face2026-04-01 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/essobi/trl_gair_lima_v19
下载链接
链接失效反馈官方服务:
资源简介:
---
dataset_info:
features:
- name: messages
list:
- name: role
dtype: string
- name: content
dtype: string
- name: metadata
dtype: json
splits:
- name: train
- name: test
download_size: unknown
dataset_size: unknown
license: cc-by-nc-4.0
---
# GAIR/LIMA TRL Dataset
This dataset is a transformed version of the [GAIR/LIMA](https://huggingface.co/datasets/GAIR/LIMA) dataset, specifically formatted for training with TRL (Transformers Reinforcement Learning).
## Dataset Description
- **Homepage:** [GAIR/LIMA](https://huggingface.co/datasets/GAIR/LIMA)
- **Repository:** [ProblemGenerationAgent](https://github.com/er-ads/ProblemGenerationAgent)
- **Paper:** [LIMA: Less Is More for Alignment](https://arxiv.org/abs/2305.11206)
## Dataset Summary
This dataset contains conversations from the GAIR/LIMA dataset, transformed into a message-based format suitable for conversational AI training and reinforcement learning.
### Key Transformations
- Converted from Q&A format to multi-turn conversation format
- Applied message-level transformations (HTML tag removal, whitespace normalization)
- Filtered conversations based on quality criteria (length checks, content validation)
- Preserved original metadata for reference and traceability
## Dataset Structure
Each example contains:
- `messages`: List of conversation turns, each with:
- `role`: Either "user" or "assistant"
- `content`: Message text
- `metadata`: Original and derived metadata including:
- `original_id`: ID from original LIMA dataset
- `answers_count`: Number of answers in original format
- `original_data`: Preserved original fields
## Splits
- **train**: Training split (filtered LIMA train set)
- **test**: Test split (filtered LIMA test set)
## Creation Process
This dataset was created using the transformation pipeline defined in the ProblemGenerationAgent repository:
1. **Loading**: GAIR/LIMA dataset loaded from HuggingFace Hub
2. **Transformation**: Message-level transformations applied (HTML removal, whitespace normalization)
3. **Filtering**: Conversations filtered based on quality criteria:
- Minimum conversation length: 2 messages
- Minimum answer length: 50 characters
- Maximum answer length: 5000 characters
4. **Output**: Transformed conversations in message format
See `transformation_report.json` for detailed statistics about the transformation process.
See `rejection_log.json` for conversations that did not pass filtering criteria.
## Usage
```python
from datasets import load_dataset
# Load the dataset
dataset = load_dataset("essobi/trl_gair_lima")
# Access examples
for example in dataset["train"]:
messages = example["messages"]
metadata = example["metadata"]
```
## License
This dataset follows the same license as the original GAIR/LIMA dataset: CC-BY-NC-4.0
## Citation
Original GAIR/LIMA dataset:
```
@article{zhou2023lima,
title={LIMA: Less Is More for Alignment},
author={Zhou, Chunting and Liu, Pengfei and Xu, Puxin and others},
journal={arXiv preprint arXiv:2305.11206},
year={2023}
}
```
## Disclaimer
This transformed dataset is provided as-is for research and educational purposes. The original LIMA dataset is subject to its own license and terms of use.
提供机构:
essobi



