eson/cc100-samples
收藏数据集概述
数据集名称: CC100
别名: cc100
语言: 多语言,包括但不限于 af, am, ar, as, az, be, bg, bn, br, bs, ca, cs, cy, da, de, el, en, eo, es, et, eu, fa, ff, fi, fr, fy, ga, gd, gl, gn, gu, ha, he, hi, hr, ht, hu, hy, id, ig, is, it, ja, jv, ka, kk, km, kn, ko, ku, ky, la, lg, li, ln, lo, lt, lv, mg, mk, ml, mn, mr, ms, my, ne, nl, no, ns, om, or, pa, pl, ps, pt, qu, rm, ro, ru, sa, sc, sd, si, sk, sl, so, sq, sr, ss, su, sv, sw, ta, te, th, tl, tn, tr, ug, uk, ur, uz, vi, wo, xh, yi, yo, zh, zu
BCP47语言代码: bn-Latn, hi-Latn, my-x-zawgyi, ta-Latn, te-Latn, ur-Latn, zh-Hans, zh-Hant
许可证: 未知
多语言性: 多语言
大小类别: 1K<n<10K
源数据集: 原始
任务类别:
- 文本生成
- 填充掩码
任务ID:
- 语言建模
- 掩码语言建模
Paperswithcode ID: cc100
数据集结构
数据实例
每个数据点包含以下字段:
- id: 示例的ID
- text: 文本内容
示例数据点(以am配置为例):
{id: 0, text: ተለዋዋጭ የግድግዳ አንግል ሙቅ አንቀሳቅሷል ቲ-አሞሌ አጥቅሼ ... }
数据文件配置
数据集包含多种语言的配置,每个配置对应一个数据文件,例如:
- config_name: am
- data_files:
- split: train path: data/am.txt
- data_files:
- config_name: ar
- data_files:
- split: train path: data/ar.txt
- data_files:
- config_name: as
- data_files:
- split: train path: data/as.txt
- data_files:
- config_name: az
- data_files:
- split: train path: data/az.txt
- data_files:
- config_name: be
- data_files:
- split: train path: data/be.txt
- data_files:
- config_name: bg
- data_files:
- split: train path: data/bg.txt
- data_files:
- config_name: bn
- data_files:
- split: train path: data/bn.txt
- data_files:
- config_name: bn_rom
- data_files:
- split: train path: data/bn_rom.txt
- data_files:
- config_name: br
- data_files:
- split: train path: data/br.txt
- data_files:
- config_name: bs
- data_files:
- split: train path: data/bs.txt
- data_files:
- config_name: ca
- data_files:
- split: train path: data/ca.txt
- data_files:
- config_name: cs
- data_files:
- split: train path: data/cs.txt
- data_files:
- config_name: cy
- data_files:
- split: train path: data/cy.txt
- data_files:
- config_name: da
- data_files:
- split: train path: data/da.txt
- data_files:
- config_name: de
- data_files:
- split: train path: data/de.txt
- data_files:
- config_name: el
- data_files:
- split: train path: data/el.txt
- data_files:
- config_name: en
- data_files:
- split: train path: data/en.txt
- data_files:
- config_name: eo
- data_files:
- split: train path: data/eo.txt
- data_files:
- config_name: es
- data_files:
- split: train path: data/es.txt
- data_files:
- config_name: et
- data_files:
- split: train path: data/et.txt
- data_files:
- config_name: eu
- data_files:
- split: train path: data/eu.txt
- data_files:
- config_name: fa
- data_files:
- split: train path: data/fa.txt
- data_files:
- config_name: ff
- data_files:
- split: train path: data/ff.txt
- data_files:
- config_name: fi
- data_files:
- split: train path: data/fi.txt
- data_files:
- config_name: fr
- data_files:
- split: train path: data/fr.txt
- data_files:
- config_name: fy
- data_files:
- split: train path: data/fy.txt
- data_files:
- config_name: ga
- data_files:
- split: train path: data/ga.txt
- data_files:
- config_name: gd
- data_files:
- split: train path: data/gd.txt
- data_files:
- config_name: gl
- data_files:
- split: train path: data/gl.txt
- data_files:
- config_name: gn
- data_files:
- split: train path: data/gn.txt
- data_files:
- config_name: gu
- data_files:
- split: train path: data/gu.txt
- data_files:
- config_name: ha
- data_files:
- split: train path: data/ha.txt
- data_files:
- config_name: he
- data_files:
- split: train path: data/he.txt
- data_files:
- config_name: hi
- data_files:
- split: train path: data/hi.txt
- data_files:
- config_name: hi_rom
- data_files:
- split: train path: data/hi_rom.txt
- data_files:
- config_name: hr
- data_files:
- split: train path: data/hr.txt
- data_files:
- config_name: ht
- data_files:
- split: train path: data/ht.txt
- data_files:
- config_name: hu
- data_files:
- split: train path: data/hu.txt
- data_files:
- config_name: hy
- data_files:
- split: train path: data/hy.txt
- data_files:
- config_name: id
- data_files:
- split: train path: data/id.txt
- data_files:
- config_name: ig
- data_files:
- split: train path: data/ig.txt
- data_files:
- config_name: is
- data_files:
- split: train path: data/is.txt
- data_files:
- config_name: it
- data_files:
- split: train path: data/it.txt
- data_files:
- config_name: ja
- data_files:
- split: train path: data/ja.txt
- data_files:
- config_name: jv
- data_files:
- split: train path: data/jv.txt
- data_files:
- config_name: ka
- data_files:
- split: train path: data/ka.txt
- data_files:
- config_name: kk
- data_files:
- split: train path: data/kk.txt
- data_files:
- config_name: km
- data_files:
- split: train path: data/km.txt
- data_files:
- config_name: kn
- data_files:
- split: train path: data/kn.txt
- data_files:
- config_name: ko
- data_files:
- split: train path: data/ko.txt
- data_files:
- config_name: ku
- data_files:
- split: train path: data/ku.txt
- data_files:
- config_name: ky
- data_files:
- split: train path: data/ky.txt
- data_files:
- config_name: la
- data_files:
- split: train path: data/la.txt
- data_files:
- config_name: lg
- data_files:
- split: train path: data/lg.txt
- data_files:
- config_name: li
- data_files:
- split: train path: data/li.txt
- data_files:
- config_name: ln
- data_files:
- split: train path: data/ln.txt
- data_files:
- config_name: lo
- data_files:
- split: train path: data/lo.txt
- data_files:
- config_name: lt
- data_files:
- split: train path: data/lt.txt
- data_files:
- config_name: lv
- data_files:
- split: train path: data/lv.txt
- data_files:
- config_name: mg
- data_files:
- split: train path: data/mg.txt
- data_files:
- config_name: mk
- data_files:
- split: train path: data/mk.txt
- data_files:
- config_name: ml
- data_files:
- split: train path: data/ml.txt
- data_files:
- config_name: mn
- data_files:
- split: train path: data/mn.txt
- data_files:
- config_name: mr
- data_files:
- split: train path: data/mr.txt
- data_files:
- config_name: ms
- data_files:
- split: train path: data/ms.txt
- data_files:
- config_name: my
- data_files:
- split: train path: data/my.txt
- data_files:
- config_name: my_zaw
- data_files:
- split: train path: data/my_zaw.txt
- data_files:
- config_name: ne
- data_files:
- split: train path: data/ne.txt
- data_files:
- config_name: nl
- data_files:
- split: train path: data/nl.txt
- data_files:
- config_name: no
- data_files:
- split: train path: data/no.txt
- data_files:
- config_name: ns
- data_files:
- split: train path: data/ns.txt
- data_files:
- config_name: om
- data_files:
- split: train path: data/om.txt
- data_files:
- config_name: or
- data_files:
- split: train path: data/or.txt
- data_files:
- config_name: pa
- data_files:
- split: train path: data/pa.txt
- data_files:
- config_name: pl
- data_files:
- split: train path: data/pl.txt
- data_files:
- config_name: ps
- data_files:
- split: train path: data/ps.txt
- data_files:
- config_name: pt
- data_files:
- split: train path: data/pt.txt
- data_files:
- config_name: qu
- data_files:
- split: train path: data/qu.txt
- data_files:
- config_name: rm
- data_files:
- split: train path: data/rm.txt
- data_files:
- config_name: ro
- data_files:
- split: train path: data/ro.txt
- data_files:
- config_name: ru
- data_files:
- split: train path: data/ru.txt
- data_files:
- config_name: sa
- data_files:
- split: train path: data/sa.txt
- data_files:
- config_name: si
- data_files:
- split: train path: data/si.txt
- data_files:
- config_name: sc
- data_files:
- split: train path: data/sc.txt
- data_files:
- config_name: sd
- data_files:
- split: train path: data/sd.txt
- data_files:
- config_name: sk
- data_files:
- split: train path: data/sk.txt
- data_files:
- config_name: sl
- data_files:
- split: train path: data/sl.txt
- data_files:
- config_name: so
- data_files:
- split: train path: data/so.txt
- data_files:
- config_name: sq
- data_files:
- split: train path: data/sq.txt
- data_files:
- config_name: sr
- data_files:
- split: train path: data/sr.txt
- data_files:
- config_name: ss
- data_files:
- split: train path: data/ss.txt
- data_files:
- config_name: su
- data_files:
- split: train path: data/su.txt
- data_files:
- config_name: sv
- data_files:
- split: train path: data/sv.txt
- data_files:
- config_name: sw
- data_files:
- split: train path: data/sw.txt
- data_files:
- config_name: ta
- data_files:
- split: train path: data/ta.txt
- data_files:
- config_name: ta_rom
- data_files:
- split: train path: data/ta_rom.txt
- data_files:
- config_name: te
- data_files:
- split: train path: data/te.txt
- data_files:
- config_name: te_rom
- data_files:
- split: train path: data/te_rom.txt
- data_files:
- config_name: th
- data_files:
- split: train path: data/th.txt
- data_files:
- config_name: tl
- data_files:
- split: train path: data/tl.txt
- data_files:
- config_name: tn
- data_files:
- split: train path: data/tn.txt
- data_files:
- config_name: tr
- data_files:
- split: train path: data/tr.txt
- data_files:
- config_name: ug
- data_files:
- split: train path: data/ug.txt
- data_files:
- config_name: uk
- data_files:
- split: train path: data/uk.txt
- data_files:
- config_name: ur
- data_files:
- split: train path: data/ur.txt
- data_files:
- config_name: ur_rom
- data_files:
- split: train path: data/ur_rom.txt
- data_files:
- config_name: uz
- data_files:
- split: train path: data/uz.txt
- data_files:
- config_name: vi
- data_files:
- split: train path: data/vi.txt
- data_files:
- config_name: wo
- data_files:
- split: train path: data/wo.txt
- data_files:
- config_name: xh
- data_files:
- split: train path: data/xh.txt
- data_files:
- config_name: yi
- data_files:
- split: train path: data/yi.txt
- data_files:
- config_name: yo
- data_files:
- split: train path: data/yo.txt
- data_files:
- config_name: zh-Hans
- data_files:
- split: train path: data/zh-Hans.txt
- data_files:
- config_name: zh-Hant
- data_files:
- split: train path: data/zh-Hant.txt
- data_files:
- config_name: zu
- data_files:
- split: train path: data/zu.txt
- data_files:




