five

agentlans/tatoeba-english-translations

收藏
Hugging Face2024-10-12 更新2024-12-14 收录
下载链接:
https://hf-mirror.com/datasets/agentlans/tatoeba-english-translations
下载链接
链接失效反馈
官方服务:
资源简介:
该数据集来源于Tatoeba数据库,专注于英语句子及其翻译。它包括使用文本质量、情感和可读性模型对英语句子的评估。数据集设计用于多语言文本质量、可读性和情感分析相关任务。数据集结构包括唯一标识符、英语句子、翻译标识符、翻译语言ISO代码、翻译句子、文本质量评分、可读性评分和情感评分。数据集被分为多个CSV文件,分别关注英语和翻译文本、文本质量评分、可读性评分和情感评分。每个质量、可读性、情感文件最多包含50,000行,这些行来自25,000对(每10个bin中的2,500对)分为英语和非英语条目。数据集的创建基于Tatoeba数据库中的用户贡献翻译,并使用特定模型进行注释。数据集可能包含个人姓名和位置作为句子的一部分。使用该数据集可以改进机器翻译系统和文本分析工具,可能有助于消除语言障碍和增强跨文化交流。数据集可能反映原始Tatoeba数据库和评估模型中存在的偏见。质量、可读性和情感评分是模型生成的,可能并不总是准确反映人类判断。

This dataset is derived from the Tatoeba database, focusing on English sentences and their translations. It includes assessments of English sentences using text quality, sentiment, and readability models. The dataset is designed for tasks related to multilingual text quality, readability, and sentiment analysis. The dataset structure includes unique identifiers, English sentences, translation identifiers, ISO codes of the translation language, translated sentences, text quality scores, readability scores, and sentiment scores. The dataset is divided into multiple CSV files, focusing on English and translated text, text quality scores, readability scores, and sentiment scores. Each of the quality, readability, sentiment files contains a maximum of 50,000 rows, which are from 25,000 pairs (2,500 from each of 10 bins) split into English and non-English entries. The dataset creation is based on user-contributed translations from the Tatoeba database and annotated using specific models. The dataset may contain personal names and locations as part of the sentences. Using this dataset can improve machine translation systems and text analysis tools, potentially bridging language barriers and enhancing cross-cultural communication. The dataset may reflect biases present in the original Tatoeba database and in the assessment models used. The quality, readability, and sentiment scores are model-generated and may not always accurately reflect human judgments.
提供机构:
agentlans
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作