five

Stylometric Bias and Machine-Readable Voice in AI Speech Translation

收藏
DataONE2025-05-22 更新2025-11-01 收录
下载链接:
https://search.dataone.org/view/sha256:b87810a60a52704644ae86a6ddb9213d4e1785764b4b31bf2baf3b6a118d5ec1
下载链接
链接失效反馈
官方服务:
资源简介:
This study explores how stylistic regularities emerge in AI-generated speech-to-text translations by examining the English output of the CoVoST corpus, a large-scale multilingual dataset of voice-translated speech. Without relying on human reference translations, we adopt a stylometric and machine learning approach to identify internal consistency bias—manifested as reduced lexical diversity, flattened syntactic structures, and repetitive discourse patterns—in English translations of Chinese spoken input. Stylometric features such as mean type-token ratio (MTTR), parse tree depth, modal verb frequency, and discourse marker usage are extracted from 16,899 English sentences produced by an automatic speech translation system. Using unsupervised clustering and SHAP-enhanced classifiers, we uncover latent stylistic archetypes and domain-invariant regularities that signal translationese unique to speech-based neural MT systems. The results demonstrate that CoVoST's AI outputs exhibit a distinctive translation style: one that prioritizes simplicity and syntactic regularity at the expense of natural variation. This study contributes to Translation Studies by reorienting the concept of translationese away from human-vs-machine binaries, toward intra-AI stylometric drift. Our findings offer both empirical and theoretical insight into how speech-to-text AI systems reconfigure interlingual stylistic norms.
创建时间:
2025-10-29
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作