werty1248/sentence-transformer-parallel-En-Ko-with-Similarity
收藏Hugging Face2024-07-10 更新2024-07-22 收录
下载链接:
https://hf-mirror.com/datasets/werty1248/sentence-transformer-parallel-En-Ko-with-Similarity
下载链接
链接失效反馈官方服务:
资源简介:
该数据集是Parallel Sentences Datasets的En-Ko子集,主要用于翻译任务,涉及英语和韩语。数据集通过嵌入相似度进行预处理,以解决大规模翻译对数据中出现的错误匹配问题。数据集包括多个子集,如parallel-sentences-global-voices、parallel-sentences-muse等,每个子集都有其特定的预处理步骤和特点。相似度测量使用BAAI/BGE-m3模型,建议使用相似度在0.65~0.7以上的数据。部分数据可能受限于商业使用许可。
This dataset contains parallel sentence data in English and Korean, primarily for translation tasks. It addresses common mismatches in machine translation by preprocessing with embedding similarity, enhancing the accuracy of translation pairs. The dataset includes multiple subsets, such as global news articles from Global Voices, word translation data, etc., each with specific characteristics and preprocessing steps.
提供机构:
werty1248



