prakod/gcm_enhi_filtred_1200000
收藏Hugging Face2024-08-16 更新2024-12-14 收录
下载链接:
https://hf-mirror.com/datasets/prakod/gcm_enhi_filtred_1200000
下载链接
链接失效反馈官方服务:
资源简介:
该数据集包含多个特征字段,如idx(索引)、L1(第一语言)、L2(第二语言)、CM_candidates(候选内容)、CM_candidates_transliterated_indictrans(使用IndicTrans工具转写的候选内容)和CMI_unicode_based_LID(基于Unicode的语言识别分数)。这些字段可能用于语言处理、文本转换或语言识别等自然语言处理任务。数据集包含一个训练集分割,包含1,161,294个样本,总大小为348,252,581字节。
This dataset includes several feature fields such as idx (index), L1 (first language), L2 (second language), CM_candidates (candidate content), CM_candidates_transliterated_indictrans (candidate content transliterated using IndicTrans), and CMI_unicode_based_LID (Unicode-based language identification score). These fields may be used for language processing, text conversion, or language identification tasks in natural language processing. The dataset contains a training split with 1,161,294 samples and a total size of 348,252,581 bytes.
提供机构:
prakod



