Data for: Language Models, Surprisal and Fantasy in Slavic Intercomprehension

Name: Data for: Language Models, Surprisal and Fantasy in Slavic Intercomprehension
Creator: Mendeley Data
License: 暂无描述

doi.org2025-01-16 收录

下载链接：

http://doi.org/10.17632/ygsyczp8vr.1

下载链接

链接失效反馈

官方服务：

资源简介：

The file webresults_cloze_publication.xlsx contains two types of data: a) transcripts of think-aloud protocols and b) respones collected in a web-based intercomprehension experiment for the same stimuli respectively. Part a) Three Polish stimuli sentences were presented to pairs of Czech native speakers in an experimental setting where both participants saw the stimulus sentence on their computer screens. Placed in different rooms, they were asked to communicate over skype and work together in order to come up with a good Czech translation of the sentence. Hence, the experiment output are audio recordings of the two participants trying to decode the stimuli and the written translations they have entered during the experiment. The transcripts are in sheet 1, 3, and 5 of the .xlsx file. Part b) Czech readers (n=23) were asked to translate certain words or phrases within Polish sentences (those that turned out problematic in part a) into Czech in a web-based translation experiment in cloze task design over the website http://intercomprehension.coli.uni-saarland.de/en/. The responses of part b) and corresponding sociodemographic data are in sheet 2, 4, and 6 of the .xlsx file. The responses were checked manually for correctness. Responses with typos were counted as correct, for the main interest was to find out if respondents had understood the stimuli. The column "Total Time Spent (ms)" is the time respondents have spent on entering their response into the gaps in the cloze test until pressing enter. The file surprisal_scores_CS_LM.txt contains surprisal scores obtained from a statistical trigram language model with Kneser-Ney smoothing trained on a Czech corpus (Czech part of InterCorp merged with the Czech part of the Russian National Corpus, size: 175,190 words).

文件 webresults_cloze_publication.xlsx 内含两种类型的数据：a) 口述协议的录音文本，以及 b) 在网络化互译实验中针对同一刺激物收集的响应。在部分 a) 中，三句波兰语刺激句被呈现给一对捷克母语者，实验设置中，两位参与者均在电脑屏幕上看到刺激句。他们被安置在不同的房间内，并被要求通过 Skype 进行沟通并协作，以得出一句优秀的捷克语翻译。因此，实验输出为两位参与者尝试解码刺激物并输入实验期间所编写的翻译文本的音频记录。录音文本位于.xlsx文件的第1、3和5页。在部分 b) 中，23名捷克语读者被要求在波兰语句子中翻译某些单词或短语（在部分 a) 中表现为问题句），在网络化翻译实验的补全任务设计中，在http://intercomprehension.coli.uni-saarland.de/en/网站上完成。部分 b) 的响应及其相应的社会经济数据位于.xlsx文件的第2、4和6页。对响应进行了人工检查以确保正确性。含有错别字的响应被视为正确，主要目的是探究受访者是否理解了刺激物。"总耗时（毫秒）"列表示受访者将响应输入到补全测试中的空缺处并按下回车键所花费的时间。文件 surprisal_scores_CS_LM.txt 包含从基于统计的三元语言模型中获得的惊喜分数，该模型使用 Kneser-Ney 光滑技术训练于捷克语语料库（InterCorp 的捷克语部分与俄罗斯国家语料库的捷克语部分合并，规模：175,190 个单词）。

提供机构：

Mendeley Data

5,000+

优质数据集

54 个

任务类型

进入经典数据集