five

viet4all

收藏
魔搭社区2025-11-27 更新2025-11-03 收录
下载链接:
https://modelscope.cn/datasets/ontocord/viet4all
下载链接
链接失效反馈
官方服务:
资源简介:
# Viet4All: Enhancing Vietnamese Conversational AI We are excited to introduce **Viet4All** which is a subsample of other open source performant conversational datasets. We start with a carefully curated subset of the **OpenHermes-2.5-Viet dataset**, co-created by **[@qnguyen3](https://twitter.com/stablequan)** and **[@teknium](https://twitter.com/teknium)**. This dataset is specifically designed to support the training and evaluation of Multilingual language models, such as Vistral-7B-chat and VinaLlama-7B-chat, and is derived from our Supervised Fine-Tuning (SFT) data. We have included Vietnamese here, but will add more languages. # Version: This is 0.1. # Dataset Description For version 0.1, we striped down our OpenHermes-2.5 with 1M data items subsampled to 25,000+ and translated to Vietnamese. The subsampling was done by applying TopicBERT and using consine similarity embedding based clustering to slim down the original dataset. This dataset includes the following topics: - Role Playing - Context-Aware Question Answering - Agent Prompting - Coding - and more... The majority of the answers in this dataset were generated by state-of-the-art language models, **GPT-4** and **Claude 2.1**, ensuring high-quality and diverse responses. We claim no rights to the output generated by other models, and merely redistribute a translation of the orgiginal OpenHermes-2.5 dataset. The intention of the dataset is to be used to research multi-lingual abilities of AI models in order to ensure fairness and equal access. # License Our own modifications to the OpenHeremes-2.5 dataset is licensed under CC-0. However, the original OpenHermes-2.5 dataset is licensed as set forth [here](https://huggingface.co/datasets/teknium/OpenHermes-2.5). Moreover, you should look over the GPT4 and Claude terms of use and consult your advisors for the applicability of using the data for your purposes. # Acknowledgement We would like to express our sincere gratitude to the following individuals and organizations for their invaluable contributions: - **@teknium**, **@autometa**, and other open-source dataset creators for the remarkable OpenHermes-2.5 Dataset, which serves as the foundation for ViHermes-25K. - **@qnguyen3** and **@nampdn-ai** for their dedication in translating and regenerating the answers in Vietnamese, making this dataset accessible to the Vietnamese AI community. We are committed to fostering collaboration and advancement in the field of natural language processing, and we believe that **Viet4All** will be a valuable resource for researchers and developers alike. # Notice Please be aware that you use this dataset at your own risk and we disclaim all liabilities with respect to the data, including any harmful or bias responses. This dataset has **NOT** been filtered for safety. Moreover, we disclaim all warranties, whether express or implied and all laibilities with respect to infringment, fitness for a particular puprpose, or otherwise. ``` @article{Viet4All2024, title={Viet4All: Enhancing Multilingual Conversational AI}, author={Nguyen, Q., }, journal={GitHub repository}, year={2024}, publisher={HuggingFace Datasets} } ```

# Viet4All:提升越南语会话式AI性能 我们很高兴推出**Viet4All**,该数据集是对多款优质开源会话数据集的子采样子集。其核心源自**[@qnguyen3](https://twitter.com/stablequan)**与**[@teknium](https://twitter.com/teknium)**联合构建的**OpenHermes-2.5-Viet数据集**。本数据集专为支持多语言大语言模型的训练与评估而设计,例如Vistral-7B-chat与VinaLlama-7B-chat,其数据源自我们的监督微调(Supervised Fine-Tuning,SFT)数据集。本次版本仅涵盖越南语,后续将支持更多语种。 # 版本 当前版本为0.1。 # 数据集说明 针对0.1版本,我们从包含100万条数据的OpenHermes-2.5数据集中进行子采样,筛选出25000余条数据并将其翻译为越南语。本次子采样通过TopicBERT算法实现,并基于余弦相似度嵌入向量进行聚类,以精简原始数据集。 本数据集涵盖以下主题: - 角色扮演 - 上下文感知问答 - AI智能体提示(Agent Prompting) - 代码任务 - 及其他主题 本数据集的绝大多数回复由当前顶尖大语言模型**GPT-4**与**Claude 2.1**生成,以确保回复的高质量与多样性。 我们对其他模型生成的内容不主张任何权利,仅对原始OpenHermes-2.5数据集进行翻译并重新分发。本数据集的设计初衷是用于研究AI模型的多语言能力,以保障AI技术的公平性与普惠性。 # 授权协议 我们对OpenHermes-2.5数据集所做的修改部分采用CC-0协议开源。原始OpenHermes-2.5数据集的授权条款请参见[此处](https://huggingface.co/datasets/teknium/OpenHermes-2.5)。此外,请您仔细查阅GPT-4与Claude的使用条款,并咨询相关专业人士以确认该数据集在您的使用场景下的适用性。 # 致谢 我们谨向以下个人与机构致以诚挚谢意,感谢他们提供的宝贵贡献: - **@teknium**、**@autometa**及其他开源数据集创作者,他们所构建的优质OpenHermes-2.5数据集是ViHermes-25K的核心基础; - **@qnguyen3**与**@nampdn-ai**,他们致力于将数据集内容翻译并重构为越南语,使越南AI社区能够使用本数据集。 我们致力于推动自然语言处理领域的协作与发展,并坚信**Viet4All**将成为研究人员与开发者的宝贵资源。 # 声明 请注意:您使用本数据集需自行承担风险,我们不对数据集内容(包括任何有害或存在偏见的回复)承担任何责任。本数据集**未经过安全过滤**。此外,我们不承担任何明示或默示的担保责任,也不对数据集的侵权问题、特定用途适用性或其他相关问题承担任何责任。 @article{Viet4All2024, title={Viet4All: Enhancing Multilingual Conversational AI}, author={Nguyen, Q., }, journal={GitHub repository}, year={2024}, publisher={HuggingFace Datasets} }
提供机构:
maas
创建时间:
2025-10-11
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作