five

umarzein/databricks-dolly-15k-id

收藏
Hugging Face2023-05-07 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/umarzein/databricks-dolly-15k-id
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: cc-by-sa-3.0 --- status: incomplete (need further adjustments) This dataset was created by translating "databricks-dolly-15k.jsonl" from english into indonesian using facebook/m2m100_418M and applying further adjustments. Further adjustments includes: 1. fixing words which are still in english 2. adjusting responses which start with stopwords e.g.: "oleh", "di", "dengan" 3. fixing repetitions which occur in multi-line text ("Everything Everything Everything Everything ...") This dataset can be used for any purpose, whether academic or commercial, under the terms of the Creative Commons Attribution-ShareAlike 3.0 Unported License. ## Caveats The current databricks' dolly 15k dataset may not completely match with this one Row indeces that contain repetition erorrs (207): 96 112 262 273 369 376 389 410 415 432 581 586 597 685 870 886 936 957 964 979 985 1025 1120 1216 1223 1246 1251 1262 1316 1495 1552 1614 1684 1697 1733 1756 1808 1878 1893 2060 2118 2152 2168 2464 2474 2615 2663 2712 2829 2971 3046 3068 3123 3154 3178 3289 3336 3340 3401 3545 3574 3593 3599 3629 3745 3883 3889 3896 3967 3978 3993 4181 4186 4220 4232 4338 4358 4460 4497 4516 4614 4645 4689 4757 4809 4826 4865 5107 5232 5266 5296 5418 5493 5754 5791 5797 5819 5852 5968 6354 6409 6481 6499 6553 6555 6580 6659 6866 6911 6944 7020 7074 7116 7169 7390 7599 7777 7787 7846 7870 7894 8036 8051 8090 8144 8188 8294 8349 8406 8471 8527 8546 8552 8777 8836 8852 9026 9133 9136 9186 9287 9329 9335 9365 9475 9508 9509 9607 9630 9701 9731 9790 9822 9855 10214 10251 10308 10475 10536 10546 10683 10776 10803 10972 11069 11085 11199 11334 11350 11407 11421 11540 11570 11658 11758 11774 12004 12064 12374 12380 12519 12591 12623 12764 12844 12849 12923 12926 12953 13099 13225 13231 13352 13428 13602 13634 13810 13833 13851 13893 14021 14097 14145 14234 14240 14826 14884
提供机构:
umarzein
原始信息汇总

数据集概述

该数据集是通过将 "databricks-dolly-15k.jsonl" 从英语翻译成印度尼西亚语,并使用 facebook/m2m100_418M 模型进行进一步调整创建的。

进一步调整包括:

  1. 修正仍为英语的词汇。
  2. 调整以停用词(如 "oleh", "di", "dengan")开头的响应。
  3. 修正多行文本中的重复问题(例如 "Everything Everything Everything Everything ...")。

使用许可

该数据集可用于任何学术或商业目的,遵循 Creative Commons Attribution-ShareAlike 3.0 Unported License。

注意事项

  • 当前的 databricks dolly 15k 数据集可能与此数据集不完全匹配。
  • 包含重复错误的行索引(共 207 行):
    • 96, 112, 262, 273, 369, 376, 389, 410, 415, 432, 581, 586, 597, 685, 870, 886, 936, 957, 964, 979, 985, 1025, 1120, 1216, 1223, 1246, 1251, 1262, 1316, 1495, 1552, 1614, 1684, 1697, 1733, 1756, 1808, 1878, 1893, 2060, 2118, 2152, 2168, 2464, 2474, 2615, 2663, 2712, 2829, 2971, 3046, 3068, 3123, 3154, 3178, 3289, 3336, 3340, 3401, 3545, 3574, 3593, 3599, 3629, 3745, 3883, 3889, 3896, 3967, 3978, 3993, 4181, 4186, 4220, 4232, 4338, 4358, 4460, 4497, 4516, 4614, 4645, 4689, 4757, 4809, 4826, 4865, 5107, 5232, 5266, 5296, 5418, 5493, 5754, 5791, 5797, 5819, 5852, 5968, 6354, 6409, 6481, 6499, 6553, 6555, 6580, 6659, 6866, 6911, 6944, 7020, 7074, 7116, 7169, 7390, 7599, 7777, 7787, 7846, 7870, 7894, 8036, 8051, 8090, 8144, 8188, 8294, 8349, 8406, 8471, 8527, 8546, 8552, 8777, 8836, 8852, 9026, 9133, 9136, 9186, 9287, 9329, 9335, 9365, 9475, 9508, 9509, 9607, 9630, 9701, 9731, 9790, 9822, 9855, 10214, 10251, 10308, 10475, 10536, 10546, 10683, 10776, 10803, 10972, 11069, 11085, 11199, 11334, 11350, 11407, 11421, 11540, 11570, 11658, 11758, 11774, 12004, 12064, 12374, 12380, 12519, 12591, 12623, 12764, 12844, 12849, 12923, 12926, 12953, 13099, 13225, 13231, 13352, 13428, 13602, 13634, 13810, 13833, 13851, 13893, 14021, 14097, 14145, 14234, 14240, 14826, 14884
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作