five

afrizalha/KamusOne-28M-Indonesian

收藏
Hugging Face2024-05-02 更新2024-06-12 收录
下载链接:
https://hf-mirror.com/datasets/afrizalha/KamusOne-28M-Indonesian
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: mit task_categories: - text-generation language: - id size_categories: - 100K<n<1M --- <center> <img src="https://imgur.com/3LOSEy4.png" alt="KamusOne" width="600" height="300"> <p><em>KamusOne (Kamus-1) is a synthethic Indonesian language dataset, generated by Mixtral8x7B.</em></p> </center> **About** This dataset was generated by Mixtral 8x7B. For the procedure, Mixtral is instructed that it will act as an Indonesian language dictionary, a native Indonesian speaker, etc. and that it will explain the meaning of a series of Indonesian words. Hence, the name of the dataset ("Kamus", literally "dictionary"). Construction of the word list goes like this. First, we extracted word frequency lists from the [Indo4B dataset](https://github.com/IndoNLP/indonlu). Then, because the resulting list is big (up to millions) and has a lot of clutter, I put it against a [full list of Indonesian words](https://github.com/Hidayathamir/kata-kbbi-github). In total, the dataset consists of 27,691,418 words. **IMPORTANT disclaimer** The goal of this dataset is for research, particularly to create a fluent language model based on a homogenous and low-volume dataset. It is not intended to augment existing pre-trained model. Why? Because the strength of Mixtral in Indonesian is mostly on its grammatical accuracy. However, it's not very good for tasks in Indonesian language, at least in my humble experience. Crucially, Mixtral would hallucinate the meaning of low-frequency Indonesian words (although this may be the case with other models too, like GPT-3.5). So, this is not intended for production-ready models, rather for research training purposes only. Developers/researchers who want to make a semantically accurate model should use only the data points with 'freq'=='A' and perhaps 'freq'=='B' in the data set. The 'freq' column describes the words' frequency, classified into A-D descending in frequency. **Creator** By: Afrizal Hasbi Azizy Find me: [LinkedIn](https://www.linkedin.com/in/afrizal-hasbi-azizy-182722218/)
提供机构:
afrizalha
原始信息汇总

数据集概述

基本信息

  • 许可证: MIT
  • 任务类别: 文本生成
  • 语言: 印度尼西亚语
  • 数据集大小: 10万<n<100万

数据集描述

  • 名称: KamusOne(Kamus-1)
  • 生成方式: 由Mixtral 8x7B生成,该模型被设定为印度尼西亚语言词典和本地印度尼西亚语使用者,用于解释一系列印度尼西亚词汇的含义。
  • 数据构建: 首先从Indo4B数据集中提取词频列表,然后与完整的印度尼西亚词汇列表对比,以筛选和整理数据。
  • 数据规模: 包含27,691,418个单词。

使用目的

  • 目标用途: 主要用于研究,特别是基于同质且低容量数据集构建流利的语言模型。不建议用于增强现有预训练模型。
  • 注意事项: 由于Mixtral在印度尼西亚语中的优势主要在于语法准确性,因此不适合用于印度尼西亚语任务。此外,Mixtral可能会对低频印度尼西亚词汇的意义产生幻觉。
  • 数据使用建议: 开发者或研究者应仅使用数据集中freq列为A和可能的B的数据点,其中freq列描述了词汇的频率,分为A至D等级,频率递减。

创建者

  • 作者: Afrizal Hasbi Azizy
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作