Granite Finnish Ngrams
收藏Granite Finnish Ngrams 数据集概述
数据集描述
该数据集包含芬兰语基于字符的ngrams(unigrams, bigrams, trigrams),用于开发Granite Layout,并与Keyboard Layout Optimizer兼容。语料库在创建ngrams之前已清理掉非典型字符。
语料库来源
语料库由以下数据集混合而成:
- 33.333% 芬兰语OpenSubtitles 2017语料库 opensub-fi-2017-src
- 66.666% 芬兰语Wikipedia语料库 wikipedia-fi-2017-src
相关数据集
最常见的ngrams
最常见的unigrams
──────────────────── finnish ───────────────────── 1: ␣ ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 11.96 2: a ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 9.82 3: i ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 8.92 4: n ▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 7.41 5: t ▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 7.39 6: e ▇▇▇▇▇▇▇▇▇▇▇▇▇ 6.73 7: s ▇▇▇▇▇▇▇▇▇▇▇▇ 6.22 8: l ▇▇▇▇▇▇▇▇▇ 4.83 9: o ▇▇▇▇▇▇▇▇▇ 4.65 10: u ▇▇▇▇▇▇▇▇ 4.11 11: k ▇▇▇▇▇▇▇▇ 4.10 12: ä ▇▇▇▇▇▇▇ 3.45 13: m ▇▇▇▇▇ 2.70 14: r ▇▇▇▇ 2.25 15: v ▇▇▇▇ 1.83 16: h ▇▇▇ 1.72 17: . ▇▇▇ 1.65 18: p ▇▇▇ 1.54 19: j ▇▇▇ 1.54 20: y ▇▇▇ 1.39 21: d ▇▇ 0.87 22: , ▇ 0.69 23: ö ▇ 0.36 24: 1 ▇ 0.34 25: 0 ▇ 0.30 26: g ▇ 0.30 27: - ▇ 0.27 28: c 0.24 29: ⏎ 0.24 30: b 0.23 31: 9 0.21 32: ? 0.20 33: " 0.20 34: 2 0.19 35: f 0.15 36: 8 0.09 37: 5 0.09 38: 3 0.08 39: w 0.08 40: 4 0.08 41: 7 0.08 42: 6 0.08 43: ! 0.07 44: ) 0.07 45: ( 0.07 46: : 0.06 47: z 0.04 48: x 0.02 49: 0.01 50: q 0.01 51: / 0.01 52: ; 0.00 53: % 0.00 54: + 0.00 55: = 0.00 56: & 0.00 57: [ 0.00 58: ] 0.00 59: * 0.00 60: _ 0.00 61: # 0.00 62: | 0.00 63: > 0.00 64: $ 0.00 65: < 0.00 66: ~ 0.00 67: € 0.00 68: @ 0.00 69: } 0.00 70: { 0.00 71: ` 0.00 72: ^ 0.00 73: 0.00
最常见的bigrams
──────────────────── finnish ───────────────────── 1: n␣ ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 3.07 2: a␣ ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 2.10 3: en ▇▇▇▇▇▇▇▇▇▇▇▇ 1.66 4: in ▇▇▇▇▇▇▇▇▇▇▇▇ 1.64 5: ta ▇▇▇▇▇▇▇▇▇▇▇ 1.50 6: .␣ ▇▇▇▇▇▇▇▇▇▇ 1.38 7: is ▇▇▇▇▇▇▇▇▇ 1.31 8: an ▇▇▇▇▇▇▇▇▇ 1.29 9: si ▇▇▇▇▇▇▇▇▇ 1.21 10: ␣k ▇▇▇▇▇▇▇▇▇ 1.19 11: st ▇▇▇▇▇▇▇▇ 1.11 12: ␣s ▇▇▇▇▇▇▇▇ 1.10 13: i␣ ▇▇▇▇▇▇▇▇ 1.07 14: ␣t ▇▇▇▇▇▇▇ 1.00 15: tt ▇▇▇▇▇▇▇ 0.96 16: ␣o ▇▇▇▇▇▇▇ 0.94 17: it ▇▇▇▇▇▇▇ 0.94 18: ␣m ▇▇▇▇▇▇▇ 0.93 19: aa ▇▇▇▇▇▇▇ 0.93 20: ä␣ ▇▇▇▇▇▇ 0.90 21: ka ▇▇▇▇▇▇ 0.89 22: ll ▇▇▇▇▇▇ 0.88 23: se ▇▇▇▇▇▇ 0.86 24: sa ▇▇▇▇▇▇ 0.86 25: ␣j ▇▇▇▇▇▇ 0.86 26: on ▇▇▇▇▇▇ 0.83 27: al ▇▇▇▇▇▇ 0.81 28: li ▇▇▇▇▇▇ 0.81 29: te ▇▇▇▇▇▇ 0.81 30: ai ▇▇▇▇▇▇ 0.79 31: tä ▇▇▇▇▇▇ 0.78 32: ti ▇▇▇▇▇ 0.76 33: ␣v ▇▇▇▇▇ 0.75 34: la ▇▇▇▇▇ 0.74 35: ja ▇▇▇▇▇ 0.73 36: va ▇▇▇▇▇ 0.72 37: ␣p ▇▇▇▇▇ 0.72 38: el ▇▇▇▇▇ 0.72 39: ␣h ▇▇▇▇▇ 0.68 40: et ▇▇▇▇▇ 0.67
最常见的bigrams(忽略空格)
──────────────────── finnish ───────────────────── 1: en ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 2.20 2: in ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 2.17 3: ta ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 1.99 4: is ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 1.74 5: an ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 1.71 6: si ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 1.60 7: st ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 1.47 8: tt ▇▇▇▇▇▇▇▇▇▇▇▇▇ 1.27 9: it ▇▇▇▇▇▇▇▇▇▇▇▇ 1.24 10: aa ▇▇▇▇▇▇▇▇▇▇▇▇ 1.23 11: ka ▇▇▇▇▇▇▇▇▇▇▇▇ 1.18 12: ll ▇▇▇▇▇▇▇▇▇▇▇▇ 1.17 13: se ▇▇▇▇▇▇▇▇▇▇▇ 1.14 14: sa ▇▇▇▇▇▇▇▇▇▇▇ 1.13 15: on ▇▇▇▇▇▇▇▇▇▇▇ 1.10 16: al ▇▇▇▇▇▇▇▇▇▇▇ 1.08 17: li ▇▇▇▇▇▇▇▇▇▇▇ 1.07 18: te ▇▇▇▇▇▇▇▇▇▇▇ 1.07 19: ai ▇▇▇▇▇▇▇▇▇▇ 1.04 20: tä ▇▇▇▇▇▇▇▇▇▇ 1.04 21: ti ▇▇▇▇▇▇▇▇▇▇ 1.00 22: la ▇▇▇▇▇▇▇▇▇▇ 0.98 23: ja ▇▇▇▇▇▇▇▇▇▇ 0.96 24: va ▇▇▇▇▇▇▇▇▇▇ 0.96 25: el ▇▇▇▇▇▇▇▇▇▇ 0.96 26: et ▇▇▇▇▇▇▇▇▇ 0.89 27: mi ▇▇▇▇▇▇▇▇▇ 0.87 28: ol ▇▇▇▇▇▇▇▇ 0.84 29: le ▇▇▇▇▇▇▇▇ 0.83 30: oi ▇▇▇▇▇▇▇▇ 0.81 31: ne ▇▇▇▇▇▇▇▇ 0.80 32: ss ▇▇▇▇▇▇▇▇ 0.79 33: tu ▇▇▇▇▇▇▇▇ 0.76 34: ma ▇▇▇▇▇▇▇▇ 0.75 35: as ▇▇▇▇▇▇▇ 0.74 36: än ▇▇▇▇▇▇▇ 0.74 37: ku ▇▇▇▇▇▇▇ 0.73 38: ko ▇▇▇▇▇▇▇ 0.70 39: ii ▇▇▇▇▇▇




