Stylometric Patterns in Human Translation of Spoken Chinese to Written English: A Corpus Analysis of the CoVoST Dataset

DataONE2026-01-25 更新2026-02-07 收录

下载链接：

https://search.dataone.org/view/sha256:a030ebeefe1c31f0b9d625fbb1eb0063263ec8d1981a75634dab413e7ae148a0

下载链接

链接失效反馈

官方服务：

资源简介：

This study explores how stylistic regularities emerge in AI-generated speech-to-text translations by examining the English output of the CoVoST corpus, a large-scale multilingual dataset of voice-translated speech. Without relying on human reference translations, we adopt a stylometric and machine learning approach to identify internal consistency bias—manifested as reduced lexical diversity, flattened syntactic structures, and repetitive discourse patterns—in English translations of Chinese spoken input. Stylometric features such as mean type-token ratio (MTTR), parse tree depth, modal verb frequency, and discourse marker usage are extracted from 16,899 English sentences produced by an automatic speech translation system. Using unsupervised clustering and SHAP-enhanced classifiers, we uncover latent stylistic archetypes and domain-invariant regularities that signal translationese unique to speech-based neural MT systems. The results demonstrate that CoVoST's AI outputs exhibit a distinctive translation style: one that prioritizes simplicity and syntactic regularity at the expense of natural variation. This study contributes to Translation Studies by reorienting the concept of translationese away from human-vs-machine binaries, toward intra-AI stylometric drift. Our findings offer both empirical and theoretical insight into how speech-to-text AI systems reconfigure interlingual stylistic norms.

创建时间：

2026-01-27