five

Observingserver/OASST-DE

收藏
Hugging Face2026-04-08 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/Observingserver/OASST-DE
下载链接
链接失效反馈
官方服务:
资源简介:
--- dataset_info: features: - name: conversation list: - name: role dtype: string - name: text dtype: string splits: - name: train num_bytes: 8022604.792326268 num_examples: 3721 download_size: 4325950 dataset_size: 8022604.792326268 license: apache-2.0 language: - de size_categories: - 1K<n<10K --- # German OpenAssistant Conversations Dataset (OASST-DE) With the goal of advancing open-source, german-language LLM research, we present OASST-DE: a high quality subset of a recent (25.08.23) dump from the [OpenAssistant website](https://www.open-assistant.io/) translated to German using the GPT-3.5 API. More details on how the dataset was filtered and translated under [dataset creation.](#dataset-creation-process) For more details on the OpenAssistant Project, look at the [first OASST dataset (OASST1)](https://huggingface.co/datasets/OpenAssistant/oasst1), [the Open-Assistant GitHub repo](https://github.com/LAION-AI/Open-Assistant) or [our paper](https://arxiv.org/abs/2304.07327). This dataset was created as part of LAION's LeoLM (Linguistically Enhanced Open Language Model) project led by Björn Plüster. Check out LeoLM-Chat trained with OASST-DE ([7b](https://huggingface.co/LeoLM/leo-hessianai-7b-chat), [13b](https://huggingface.co/LeoLM/leo-hessianai-13b-chat)) finetuned on OASST-DE and read [their blog post](https://laion.ai/blog/leo-lm/)) for more info on LeoLM. ## Dataset Creation Process This dataset was created from a recent OASST dump by following these steps: - Filter for Top1 response trees with assistant response leaves - Filter first prompt quality >= 0.5 - Filter total conversation length < 1900 tokens to fit in GPT3.5 context length - Filter for `'lang' == 'de'` -> add to dataset - Filter for `'lang' == 'en'` (other languages often result in failed translations) - Translate using GPT-3.5-turbo API (total cost ~15$). This results in around 3.7k samples of high-quality assistant conversations. ## Dataset Structure This dataset has only one `'conversation'` field. Each example is a list of an alternating conversation between `'prompter'` and `'assistant'`, where each entry is a dict with `'text'` and `'role'` fields: ```json "conversation": [ {"role": "prompter", "text": "Moin, wie geht's dir?"}, {"role": "assistant", "text": "Moin Moin! Mir geht es gut, und dir?"}, ... ] ``` ## Usage with 🤗Datasets: ```python from datasets import load_dataset ds = load_dataset("OpenAssistant/OASST-DE", split="train") print(ds[0]["conversation"]) ```
提供机构:
Observingserver
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作