five

NekoFi/alpaca-gpt4-indonesia-cleaned

收藏
Hugging Face2024-04-26 更新2024-06-12 收录
下载链接:
https://hf-mirror.com/datasets/NekoFi/alpaca-gpt4-indonesia-cleaned
下载链接
链接失效反馈
官方服务:
资源简介:
--- task_categories: - text-generation language: - id tags: - alpaca - indonesia - instruction-finetuning pretty_name: alpaca-gpt4-indonesia-clean size_categories: - 10K<n<100K --- # Dataset Card for Alpaca-GPT4-Indonesian-Cleaned - **Repository:** https://huggingface.co/datasets/FreedomIntelligence/alpaca-gpt4-indonesian ## Dataset Description This is a reformatted and re-curated version of the original Alpaca-GPT4-Indonesian Dataset released by FreedomIntelligence. The original dataset had several issues which have been identified and addressed in this cleaned version: 1. **Hallucinations:** Many instructions in the original dataset were ambiguous or poorly defined, leading to irrelevant outputs. These have been refined or rewritten for clarity. ``` "instruction":"Ringkaskan artikel berikut dalam 200 kata.", "input": "https://www.kompas.com/news/world-51461830", "output": "Demonstrasi di Belarus merupakan bagian dari gelombang ketidakpuasan yang meningkat terhadap pemerintahan Alexander Lukashenko, yang telah memimpin sejak tahun 1994. Demonstrasi ini bertambah sejak Agustus, menyusul pemilu yang banyak dianggap curang. Kekerasan oleh polisi dan keamanan hanya membuat gerakan ini semakin besar, dengan ratusan ribu orang turun ke jalan menuntut kepemimpinan baru. Aktivis hak asasi manusia telah menuduh pemerintah menggunakan kekerasan ekstrem, yang memicu kecaman internasional dan sanksi dari Uni Eropa. Situasi di Belarus terus berkembang, dan komunitas internasional mengawasi dengan harapan akan tercapainya solusi damai." "instruction": "Klasifikasikan bentuk berikut sesuai dengan geometri sederhana.", "input": "<Gambar Terlampir>", "output": "Segitiga" ``` 2. **Merged Instructions:** Instructions that were merged inappropriately have been separated. ``` "instruction": "Apakah saya harus berinvestasi di saham?", "input": "", "output": "Tergantung pada situasi individu Anda. Berinvestasi di saham mempunyai risiko tetapi juga potensi keuntungan. Pastikan untuk mempelajari berbagai jenis saham, memahami risiko dan keuntungan, dan mengembangkan rencana yang cocok untuk Anda." ``` 3. **Empty outputs:** All entries with empty outputs have been filled with appropriate responses. ``` "instruction": "Hitung luas bentuk berikut dalam sentimeter persegi.", "input": "persegi panjang ukuran 4 cm x 5 cm", "output": "Luasnya adalah 20 cm²." ``` 4. **Consistency in Input Fields:** The original dataset's inconsistent input field usage has been standardized to "no input" when applicable. ``` "input":"no input" ``` 5. **Corrected Mathematical Responses:** Mathematical inaccuracies found in the original dataset have been corrected. ``` "instruction": "Hitung median dari set data berikut.", "input": "1, 2, 4, 5, 8, 9", "output": "4.5" ``` 6. **Clarified Instructions:** Vague or nonsensical instructions have been clarified or rewritten. ``` "instruction": "Tingkatkan ukuran font menjadi 12 poin.", "input": "", "output": "Ukuran font telah ditingkatkan menjadi 12 poin." ``` 7. **Language Adjustments:** All dataset entries have been reviewed and adjusted for better understanding in Indonesian. ### Original Alpaca-GPT4-Indonesian Dataset Summary Alpaca-GPT4-Indonesian is a dataset of instruction and demonstration generated for fine-tuning language models to better understand and execute instructions in Indonesian. This reformatted version aims to enhance the dataset's usability and accuracy. ### Supported Tasks and Leaderboards The Alpaca-GPT4-Indonesian-Cleaned dataset is designed for instruction training of pre-trained language models, specifically tailored for Indonesian language tasks. ### Languages The data in Alpaca-GPT4-Indonesian-Cleaned are exclusively in Indonesian (BCP-47 id). ## Dataset Structure ### Data Instances An example of a "train" instance looks as follows: ``` { "instruction": "Buat tugas klasifikasi dengan mengelompokkan daftar item yang diberikan.", "input": "Apel, jeruk, pisang, stroberi, nanas", "output": "Kelas 1: Apel, Jeruk\nKelas 2: Pisang, Stroberi\nKelas 3: Nanas" } ``` ### Data Fields * `instruction`: describes the task the model should perform. * `input`: optional context or additional information for the task. * `output`: the model's response to the instruction. ### Data Splits | | train | |---------------|------:| | alpaca | 52002 | ## Dataset Creation ### Curation Rationale This dataset was curated to address and correct the shortcomings of the original dataset, ensuring higher accuracy and usability for instruction-based tasks in Indonesian. ### Licensing Information #Add Alter ### Citation Information ``` @misc{alpaca_id, author = {FreedomIntelligence, Ariel Fikru}, year = {2024} } ```
提供机构:
NekoFi
原始信息汇总

数据集概述

  • 数据集名称: Alpaca-GPT4-Indonesian-Cleaned
  • 任务类别: 文本生成
  • 语言: 印尼语
  • 标签: alpaca, indonesia, instruction-finetuning
  • 美观名称: alpaca-gpt4-indonesia-clean
  • 大小类别: 10K<n<100K

数据集描述

Alpaca-GPT4-Indonesian-Cleaned 是对原始 Alpaca-GPT4-Indonesian 数据集的重新格式化和精选版本。原始数据集存在多个问题,已在此次清理版本中得到解决:

  1. 幻觉问题: 原始数据集中的许多指令含糊不清或定义不当,导致输出无关。这些问题已通过精炼或重写指令得到澄清。
  2. 合并指令: 不恰当地合并的指令已被分离。
  3. 空输出: 所有空输出已被填充适当响应。
  4. 输入字段一致性: 原始数据集输入字段使用不一致的问题已标准化。
  5. 数学响应修正: 原始数据集中的数学不准确之处已得到修正。
  6. 指令澄清: 含糊或无意义的指令已得到澄清或重写。
  7. 语言调整: 所有数据集条目已针对印尼语理解进行了审查和调整。

原始数据集总结

Alpaca-GPT4-Indonesian 是为微调语言模型以更好地理解和执行印尼语指令而生成的一个指令和演示数据集。此重新格式化版本旨在提高数据集的可用性和准确性。

支持的任务和排行榜

Alpaca-GPT4-Indonesian-Cleaned 数据集专为预训练语言模型的指令训练设计,特别针对印尼语任务。

语言

Alpaca-GPT4-Indonesian-Cleaned 数据集中的数据仅使用印尼语(BCP-47 id)。

数据集结构

数据实例

一个“训练”实例示例如下:

json { "instruction": "Buat tugas klasifikasi dengan mengelompokkan daftar item yang diberikan.", "input": "Apel, jeruk, pisang, stroberi, nanas", "output": "Kelas 1: Apel, Jeruk Kelas 2: Pisang, Stroberi Kelas 3: Nanas" }

数据字段

  • instruction: 描述模型应执行的任务。
  • input: 任务的上下文或附加信息(可选)。
  • output: 模型对指令的响应。

数据分割

训练
alpaca 52002
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作