SLPG/Biomedical_EN_FA_Corpus
收藏Hugging Face2024-11-09 更新2024-12-14 收录
下载链接:
https://hf-mirror.com/datasets/SLPG/Biomedical_EN_FA_Corpus
下载链接
链接失效反馈官方服务:
资源简介:
该数据集是一个生物医学领域的英语-法语平行语料库,包含从Wikipedia抓取的数据。数据集的构建过程包括使用不同相似度阈值(90、85、80)提取平行句子,并进一步应用生物医学领域的过滤器来确保数据的领域相关性。数据集总共有630万条句子,覆盖生物医学领域,并包含Medline 20测试集。该数据集旨在促进生物医学领域机器翻译的研究和开发,可用于训练新模型或改进现有模型,以实现高质量的领域特定机器翻译。
This is a parallel corpus for English-French translation in the biomedical domain. The dataset was scraped from Wikipedia for the French-English language pair, undergoing two filtering processes. The first filtering used three similarity thresholds (90, 85, 80), and the second filtering was based on proximity to biomedical domain data, using three different thresholds (20, 10, 0). The dataset contains 6.3 million sentences, covering the biomedical domain, with detailed sentence counts and test corpus information provided.
提供机构:
SLPG



