多语言多领域文本语料
收藏深圳数据产权登记服务中心2025-09-06 收录
下载链接:
https://datareg.szdex.com/trade-frame-reg/#/trade-regportal/registration/detail?id=02024110416175043000000101001140
下载链接
链接失效反馈官方服务:
资源简介:
深译科技-多语言分领域双语平行语料产品覆盖多个语种、多个领域的双语平行数据,首先该数据是以中文为核心的双语平行数据,即该数据中的内容为中-外一一对应的句对组成,语言覆盖中英、中葡、中泰等共计 57 个语言,数据量共计 46 亿句对,领域覆盖医疗、法律、电商、文旅、金融、安全、口语、科技等领域。一方面可独立用于机器翻译、自动分类、大模型对齐、大模型调参等模型及产品的研发,提升对应模型及产品的准确率;另一方面结合语音识别、图片识别等技术可完成语音翻译、图片翻译等多维度多场景的应用需求,为人类的工作及生活提供便捷的功能需求。
Shenyi Technology's Multilingual Domain-specific Bilingual Parallel Corpus Product covers bilingual parallel data across multiple languages and domains. First and foremost, this dataset is Chinese-centered bilingual parallel data, which is composed of one-to-one sentence pairs between Chinese and foreign languages. It covers a total of 57 languages including Chinese-English, Chinese-Portuguese, Chinese-Thai and others, with a total scale of 4.6 billion sentence pairs. The covered domains include medical, legal, e-commerce, cultural tourism, finance, security, spoken language, technology and other fields. On one hand, it can be independently used for the development of models and products such as machine translation, automatic classification, LLM alignment, LLM fine-tuning, etc., to improve the accuracy of corresponding models and products. On the other hand, combined with technologies such as speech recognition and image recognition, it can meet the application needs of multi-dimensional and multi-scenario scenarios including speech translation and image translation, providing convenient functional support for human work and daily life.
提供机构:
深译信息科技(珠海)有限公司
创建时间:
2024-11-20
搜集汇总
数据集介绍

背景与挑战
背景概述
该数据集包含46亿条以中文为核心的57种语言双语平行句对,覆盖医疗、法律等8大领域,适用于机器翻译、大模型训练等AI研发需求,支持跨模态翻译应用场景。
以上内容由遇见数据集搜集并总结生成



