KOpen-HQ-Hermes-2.5-60K
收藏魔搭社区2025-12-04 更新2024-08-31 收录
下载链接:
https://modelscope.cn/datasets/AI-ModelScope/KOpen-HQ-Hermes-2.5-60K
下载链接
链接失效反馈官方服务:
资源简介:
# MarkrAI/KOpen-HQ-Hermes-2.5-60K
<p align="center">
<img src="./Kopen.webp" alt="img" width="300" height="300"/>
</p>
The **`KOpen-HQ-Hermes-2.5-60K`** dataset has been released!
Anyone can use it under the MIT license, so feel free to take advantage of this **high-quality dataset**.
In this dataset, we have focused on incorporating our knowledge rather than human effort as much as possible, so there may be some translation errors.
Please keep this in mind when you cooking with it.
## Dataset Info
- Creator: Markr AI
- Developer: Seungyoo Lee, Kyujin Han
- Data generation:
We used the Near Dedup algorithm on the [Open Hermes dataset](https://huggingface.co/datasets/teknium/OpenHermes-2.5) to remove highly similar data(criteria: **Jaccard Sim, `0.8 >=`**) and then performed translation tasks using the DeepL API with 8 multiprocessing threads.
Afterward, we used SOTA LLMs (GPT-4 Turbo, Gemini, Wizard LM, Llama 3.1 405B) to score the data with Alpaca form prompts.
We then evaluated the appropriateness of these prompts and extracted and published the data with high scores.
## Dataset's purpose
Our Markr AI research guild aims to make a small contribution to the Korean open-source community.
Through this effort, we hope to invigorate the existing Korean LLM models and their ecosystem, fostering the growth of many excellent Korean language models within the expanding community.
The license for this work is the MIT license, and you are welcome to use it. However, our small wish is that instead of merely using and benefiting from this culture of community activation and sharing, all members contribute to its development and, in doing so, help it evolve further.
Lastly, if you start cooking using this dataset, please press the like button to show your support
# MarkrAI/KOpen-HQ-Hermes-2.5-60K
<p align="center">
<img src="./Kopen.webp" alt="数据集示意图" width="300" height="300"/>
</p>
**`KOpen-HQ-Hermes-2.5-60K`** 数据集现已正式发布!
任何人均可在MIT许可证框架下自由使用该数据集,敬请充分利用这一高质量数据集资源。
本数据集的构建思路以融入专业知识为核心,尽可能减少人工干预,因此可能存在部分翻译误差,在基于该数据集开展模型训练等工作时,敬请留意这一点。
## 数据集信息
- 创作者:Markr AI
- 开发者:李承侑(Seungyoo Lee)、韩圭镇(Kyujin Han)
- 数据生成流程:
我们首先对[Open Hermes数据集(Open Hermes dataset)](https://huggingface.co/datasets/teknium/OpenHermes-2.5)应用Near Dedup算法,以移除高度相似的数据(评判标准:雅卡尔相似度(Jaccard Sim)≥0.8);随后借助DeepL API,通过8个多处理线程完成翻译任务。
之后,我们采用当前前沿大语言模型(GPT-4 Turbo、Gemini、Wizard LM、Llama 3.1 405B),以Alpaca格式提示词对数据进行评分;随后对这些提示词的适配性进行评估,并筛选出高分数据予以公开发布。
## 数据集用途
我们Markr AI研究团队旨在为韩国开源社区贡献微薄之力。
通过本次工作,我们期望激活现有韩国大语言模型及其生态系统,助力在不断壮大的社区中培育出更多优秀的韩语大语言模型。
本作品采用MIT许可证,欢迎各位自由使用。不过我们的小小愿景是:希望所有社区成员不止于从社区激活与共享的文化中获取便利与收益,而是共同参与其发展,推动其进一步演进。
最后,若您基于该数据集开展模型训练等工作,请为我们点赞以示支持。
提供机构:
maas
创建时间:
2024-08-29



