lapa-llm/open-thoughts-114K-uk
收藏Hugging Face2025-11-02 更新2026-01-03 收录
下载链接:
https://hf-mirror.com/datasets/lapa-llm/open-thoughts-114K-uk
下载链接
链接失效反馈官方服务:
资源简介:
---
dataset_info:
features:
- name: system
dtype: string
- name: conversations
list:
- name: from
dtype: string
- name: value
dtype: string
- name: original
list:
- name: from
dtype: string
- name: value
dtype: string
splits:
- name: train
num_bytes: 6605487157
num_examples: 113941
download_size: 2605537935
dataset_size: 6605487157
configs:
- config_name: default
data_files:
- split: train
path: data/train-*
license: cc-by-sa-4.0
task_categories:
- text-generation
- question-answering
language:
- uk
tags:
- math
- science
- code
- puzzles
- lapa
- synthetic
pretty_name: Ukrainian OpenThoughts 114K
---
# Dataset Card for Ukrainian OpenThoughts 114K
## Dataset Description
**Dataset Summary**
The translated version of [OpenThoughts 114K](https://huggingface.co/datasets/open-thoughts/OpenThoughts-114k) to Ukrainian using [google/gemma-3-27b-it](https://huggingface.co/google/gemma-3-27b-it).
<!--[Provide a brief overview of your dataset - what it contains, its purpose, and why it was created. Example: "This dataset contains X examples of Ukrainian text collected from Y sources, designed to support the development of Ukrainian language models."] -->
**Languages**
- Ukrainian (uk)
<!-- **Dataset Structure** -->
<!-- The dataset is organized into the following splits:
| Split | Examples |
|-------|----------|
| Train | [number] |
| Validation | [number] |
| Test | [number] | -->
**Data Fields**
- `system`: original system prompt
- `conversation`: list of messages in a dialog (array of objects)
- `from`: normalized sender role — `user` or `assistant` (system messages are removed)
- `value`: message text
- `original`: original conversations from [OpenThoughts 114K](https://huggingface.co/datasets/open-thoughts/OpenThoughts-114k)
## Dataset Creation
**Source Data**
- Base dataset: [OpenThoughts 114K](https://huggingface.co/datasets/open-thoughts/OpenThoughts-114k) (Apache-2.0).
- Translation: inference with [google/gemma-3-27b-it](https://huggingface.co/google/gemma-3-27b-it).
## Considerations for Using the Data
**Intended Uses**
- Instruction/chat LLM training in Ukrainian
- Research on reasoning-heavy tasks
**Social Impact**
This dataset was created to support Ukrainian language AI development and improve language technology accessibility for Ukrainian speakers.
<!-- **Bias and Limitations**
[Discuss any known biases, limitations, or potential issues with the dataset. Be transparent about what the dataset may not be suitable for.] -->
## Citation
TBD
<!--
**BibTeX**
```bibtex
@dataset
{dataset_name,
author = {[Your Name/Organization]},
title = {[Dataset Name]},
year = {2025},
publisher = {HuggingFace},
url = {https://huggingface.co/datasets/[your-org]/[dataset-name]}
}
```
-->
## Contact
<!-- For questions or feedback, please contact [your contact information] or open an issue on the dataset repository. -->
For questions or feedback, please open an issue on the dataset repository.
## License
CC-BY-SA-4.0
---
*This dataset is part of the "Lapa" - Ukrainian LLM initiative to advance natural language processing for the Ukrainian language.*
提供机构:
lapa-llm



