umarzein/databricks-dolly-15k-id
收藏Hugging Face2023-05-07 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/umarzein/databricks-dolly-15k-id
下载链接
链接失效反馈官方服务:
资源简介:
---
license: cc-by-sa-3.0
---
status: incomplete (need further adjustments)
This dataset was created by translating "databricks-dolly-15k.jsonl" from english into indonesian using facebook/m2m100_418M and applying further adjustments.
Further adjustments includes:
1. fixing words which are still in english
2. adjusting responses which start with stopwords e.g.: "oleh", "di", "dengan"
3. fixing repetitions which occur in multi-line text ("Everything Everything Everything Everything ...")
This dataset can be used for any purpose, whether academic or commercial, under the terms of the Creative Commons Attribution-ShareAlike 3.0 Unported License.
## Caveats
The current databricks' dolly 15k dataset may not completely match with this one
Row indeces that contain repetition erorrs (207):
96
112
262
273
369
376
389
410
415
432
581
586
597
685
870
886
936
957
964
979
985
1025
1120
1216
1223
1246
1251
1262
1316
1495
1552
1614
1684
1697
1733
1756
1808
1878
1893
2060
2118
2152
2168
2464
2474
2615
2663
2712
2829
2971
3046
3068
3123
3154
3178
3289
3336
3340
3401
3545
3574
3593
3599
3629
3745
3883
3889
3896
3967
3978
3993
4181
4186
4220
4232
4338
4358
4460
4497
4516
4614
4645
4689
4757
4809
4826
4865
5107
5232
5266
5296
5418
5493
5754
5791
5797
5819
5852
5968
6354
6409
6481
6499
6553
6555
6580
6659
6866
6911
6944
7020
7074
7116
7169
7390
7599
7777
7787
7846
7870
7894
8036
8051
8090
8144
8188
8294
8349
8406
8471
8527
8546
8552
8777
8836
8852
9026
9133
9136
9186
9287
9329
9335
9365
9475
9508
9509
9607
9630
9701
9731
9790
9822
9855
10214
10251
10308
10475
10536
10546
10683
10776
10803
10972
11069
11085
11199
11334
11350
11407
11421
11540
11570
11658
11758
11774
12004
12064
12374
12380
12519
12591
12623
12764
12844
12849
12923
12926
12953
13099
13225
13231
13352
13428
13602
13634
13810
13833
13851
13893
14021
14097
14145
14234
14240
14826
14884
提供机构:
umarzein
原始信息汇总
数据集概述
该数据集是通过将 "databricks-dolly-15k.jsonl" 从英语翻译成印度尼西亚语,并使用 facebook/m2m100_418M 模型进行进一步调整创建的。
进一步调整包括:
- 修正仍为英语的词汇。
- 调整以停用词(如 "oleh", "di", "dengan")开头的响应。
- 修正多行文本中的重复问题(例如 "Everything Everything Everything Everything ...")。
使用许可
该数据集可用于任何学术或商业目的,遵循 Creative Commons Attribution-ShareAlike 3.0 Unported License。
注意事项
- 当前的 databricks dolly 15k 数据集可能与此数据集不完全匹配。
- 包含重复错误的行索引(共 207 行):
- 96, 112, 262, 273, 369, 376, 389, 410, 415, 432, 581, 586, 597, 685, 870, 886, 936, 957, 964, 979, 985, 1025, 1120, 1216, 1223, 1246, 1251, 1262, 1316, 1495, 1552, 1614, 1684, 1697, 1733, 1756, 1808, 1878, 1893, 2060, 2118, 2152, 2168, 2464, 2474, 2615, 2663, 2712, 2829, 2971, 3046, 3068, 3123, 3154, 3178, 3289, 3336, 3340, 3401, 3545, 3574, 3593, 3599, 3629, 3745, 3883, 3889, 3896, 3967, 3978, 3993, 4181, 4186, 4220, 4232, 4338, 4358, 4460, 4497, 4516, 4614, 4645, 4689, 4757, 4809, 4826, 4865, 5107, 5232, 5266, 5296, 5418, 5493, 5754, 5791, 5797, 5819, 5852, 5968, 6354, 6409, 6481, 6499, 6553, 6555, 6580, 6659, 6866, 6911, 6944, 7020, 7074, 7116, 7169, 7390, 7599, 7777, 7787, 7846, 7870, 7894, 8036, 8051, 8090, 8144, 8188, 8294, 8349, 8406, 8471, 8527, 8546, 8552, 8777, 8836, 8852, 9026, 9133, 9136, 9186, 9287, 9329, 9335, 9365, 9475, 9508, 9509, 9607, 9630, 9701, 9731, 9790, 9822, 9855, 10214, 10251, 10308, 10475, 10536, 10546, 10683, 10776, 10803, 10972, 11069, 11085, 11199, 11334, 11350, 11407, 11421, 11540, 11570, 11658, 11758, 11774, 12004, 12064, 12374, 12380, 12519, 12591, 12623, 12764, 12844, 12849, 12923, 12926, 12953, 13099, 13225, 13231, 13352, 13428, 13602, 13634, 13810, 13833, 13851, 13893, 14021, 14097, 14145, 14234, 14240, 14826, 14884



