skvarre/hopkok-v1
收藏Hugging Face2024-05-17 更新2024-06-12 收录
下载链接:
https://hf-mirror.com/datasets/skvarre/hopkok-v1
下载链接
链接失效反馈官方服务:
资源简介:
---
dataset_info:
features:
- name: conversations
list:
- name: from
dtype: string
- name: value
dtype: string
- name: weight
dtype: int64
splits:
- name: train
num_bytes: 61076054.0
num_examples: 41342
download_size: 31148485
dataset_size: 61076054.0
configs:
- config_name: default
data_files:
- split: train
path: data/train-*
---
<p align="center">
<img src="https://cdn-uploads.huggingface.co/production/uploads/651d868de1416f6d4894ede9/i63-yQg9aAeOokEtJkLtP.webp" />
</p>
## Dataset Details
Hopkok is a Swedish instruction dataset, consisting of translated examples, synthetically generated examples and Q&A examples collected from the web. The datasets have been cleaned and curated for training a Swedish large language model.
This dataset will be further iterated upon, and it is by no means considered a well cleaned and curated dataset in its current state.
## Dataset Sources
### SlimOrca-SV-33K
[SlimOrca-SV-33K](https://huggingface.co/datasets/skvarre/SlimOrca-SV-33k) is a machine translated version of a subset of the [Open-Orca/SlimOrca](https://huggingface.co/datasets/Open-Orca/SlimOrca) dataset.
The dataset has been curated to include examples that translates well to swedish, meaning that examples that includes translating to different languages has been discarded. Further, the entire SlimOrca dataset was not translated. This might come to change in further versions.
The dataset was translated with a finetuned variant of the [AI-Sweden-Models/gpt-sw3-6.7b-v2-translator](https://huggingface.co/AI-Sweden-Models/gpt-sw3-6.7b-v2-translator) model. This dataset is the only dataset in the mix that includes system prompts.
**Amount of Samples: 32972**
### Pure-Dove-SV
[Pure-Dove-SV](https://huggingface.co/datasets/skvarre/Pure-Dove-SV) is a machine translated version of the [LDJnr/Pure-Dove](https://huggingface.co/datasets/LDJnr/Pure-Dove) dataset. It has been curated and cleaned to fit a swedish context. It is included with the purpose of having more than one question/answer pair examples; meaning that one example can include a longer conversation between the human and the gpt.
This dataset was also translated with the same finetuned variant of the [AI-Sweden-Models/gpt-sw3-6.7b-v2-translator](https://huggingface.co/AI-Sweden-Models/gpt-sw3-6.7b-v2-translator) model as mentioned above.
**Amount of Samples: 2848**
### Swedish-instruct-data-chatgpt4
[swedish-instruct-data-chatgpt4](https://huggingface.co/datasets/skvarre/swedish-instruct-data-chatgpt4) is a synthetic dataset generated using ChatGPT-4. Using few-shot prompts, the model was asked to generate instruction data based on different topics that revolved around Sweden and the Swedish language in different ways.
**Amount of Samples: 1363**
### BibblanSvarar-v0.1
[BibblanSvarar-v0.1](https://huggingface.co/datasets/skvarre/BibblanSvarar-v0.1) is a dataset collected from the website [bibblansvarar.se](https://bibblansvarar.se). Bibblansvarar was an initiative from the libraries of Malmö, Sweden, that from the assignment of Kungliga Biblioteket had a service where people could send in questions and get answers.
This Queston & Answer service was decided to be discontinued the spring of 2024. The dataset has been curated and cleaned to fit an instruction format. However, more careful curation and cleaning needs to be done (and will be done) with this dataset.
**Amount of Samples: 4092**
## Why hopkok?
Well, in english, it kinda means "concoction", meaning a hotchpotch of ingredients. Also, it sounds ridiculously inappropriate in english :)
提供机构:
skvarre
原始信息汇总
数据集详情
数据集信息
- 特征:
- conversations:
- from: 数据类型为字符串
- value: 数据类型为字符串
- weight: 数据类型为int64
- conversations:
- 分割:
- train:
- 字节数: 61076054.0
- 样本数: 41342
- train:
- 下载大小: 31148485
- 数据集大小: 61076054.0
配置
- 默认配置:
- 数据文件:
- train: 路径为
data/train-*
- train: 路径为
- 数据文件:
数据集来源
-
SlimOrca-SV-33K:
- 样本数量: 32972
- 描述: 机器翻译的Open-Orca/SlimOrca子集,使用AI-Sweden-Models/gpt-sw3-6.7b-v2-translator模型翻译。
-
Pure-Dove-SV:
- 样本数量: 2848
- 描述: 机器翻译的LDJnr/Pure-Dove数据集,使用相同的翻译模型。
-
Swedish-instruct-data-chatgpt4:
- 样本数量: 1363
- 描述: 使用ChatGPT-4生成的合成数据集,基于瑞典和瑞典语言的不同主题。
-
BibblanSvarar-v0.1:
- 样本数量: 4092
- 描述: 从bibblansvarar.se收集的数据集,由瑞典马尔默图书馆的问答服务产生。



