prakod/gcm_enhi_filtred_500000

Name: prakod/gcm_enhi_filtred_500000
Creator: prakod
Published: 2024-08-16 17:18:16
License: 暂无描述

Hugging Face2024-08-16 更新2024-12-14 收录

下载链接：

https://hf-mirror.com/datasets/prakod/gcm_enhi_filtred_500000

下载链接

链接失效反馈

官方服务：

资源简介：

该数据集包含多个特征字段，如idx（索引）、L1（可能是第一语言）、L2（可能是第二语言）、CM_candidates（候选翻译或音译）、CM_candidates_transliterated_indictrans（使用IndicTrans工具音译的候选翻译）和CMI_unicode_based_LID（基于Unicode的语言识别分数）。这些字段表明数据集可能用于语言翻译、音译或语言识别任务。数据集包含一个训练集分割，共有1,166,201个样本，总大小为349,822,384字节。

The dataset includes multiple feature fields such as idx (index), L1 (possibly the first language), L2 (possibly the second language), CM_candidates (candidate translations or transliterations), CM_candidates_transliterated_indictrans (candidate translations transliterated using the IndicTrans tool), and CMI_unicode_based_LID (Unicode-based language identification score). These fields suggest that the dataset may be used for tasks related to language translation, transliteration, or language identification. The dataset contains a training split with 1,166,201 samples and a total size of 349,822,384 bytes.

提供机构：

prakod

5,000+

优质数据集

54 个

任务类型

进入经典数据集