prakod/gcm_enhi_filtred_500000
收藏Hugging Face2024-08-16 更新2024-12-14 收录
下载链接:
https://hf-mirror.com/datasets/prakod/gcm_enhi_filtred_500000
下载链接
链接失效反馈官方服务:
资源简介:
该数据集包含多个特征字段,如idx(索引)、L1(可能是第一语言)、L2(可能是第二语言)、CM_candidates(候选翻译或音译)、CM_candidates_transliterated_indictrans(使用IndicTrans工具音译的候选翻译)和CMI_unicode_based_LID(基于Unicode的语言识别分数)。这些字段表明数据集可能用于语言翻译、音译或语言识别任务。数据集包含一个训练集分割,共有1,166,201个样本,总大小为349,822,384字节。
The dataset includes multiple feature fields such as idx (index), L1 (possibly the first language), L2 (possibly the second language), CM_candidates (candidate translations or transliterations), CM_candidates_transliterated_indictrans (candidate translations transliterated using the IndicTrans tool), and CMI_unicode_based_LID (Unicode-based language identification score). These fields suggest that the dataset may be used for tasks related to language translation, transliteration, or language identification. The dataset contains a training split with 1,166,201 samples and a total size of 349,822,384 bytes.
提供机构:
prakod



