Spontaneous Dialogues in L1 English
收藏DataCite Commons2026-03-11 更新2026-05-04 收录
下载链接:
https://www.ortolang.fr/market/item/cles-en_corpus/v2
下载链接
链接失效反馈官方服务:
资源简介:
This corpus of spontaneous L1 English comprises recordings of US university students engaging in a 10-minute role-play where two
candidates delve into an argumentative discussion on
contentious topics. Each candidate assumed a specific given role, either
advocating for (role A) or against (role B) the subject. Up to 10
minutes of preparation before the talk was allowed.Although
participants were allowed to take notes, reading during the
conversation was prohibited. Their objective was to negotiate, exchange
viewpoints, and eventually work towards a compromise. This oral
interaction task is inspired by that of the CLES English certification exam
(CLES B2). A similar corpus with L2 English from French university students is available here.Another corpus with L2 English from Japanese university students is available here.Corpus AccessThis
corpus is available for academic research purposes only.
Please request access by contacting the authors.ContentsThe corpus currently comprises 2 hours of speech (14 speakers, 7 recordings).For each recording, role A student on the right, role B student on the left of the microphone.See the recordings.csv and speakers.csv metadata files for more information.Date and place of recordingFebrary 2024 at Doshisha University, Kyoto, JapanRecording infoZoom Handy Recorder H2nsampling: 44.1 kHz16 bit PCM stereo1411 kb/sAutomatic Annotations of the CorpusEach recording comes with a TextGrid file with speaker segmentation. This annotation was done automatically using the Pyannote Speaker Diarization Toolkit, then manually checked.Moreover, speech segments from the TextGrid file were automatically annotated using the Pause and Lexical Stress Processing Pipeline (PLSPP), which includes:automated speech recognition and word-level alignment (WhisperX), syllable nuclei detection (De Jong et al. 2021), part-of-speech tagging (Spacy), constituency analysis (Berkeley Neural Parser), pause position analysis, polysyllabic words' lexical stress annotation.PreprocessingSound calibration: harmonization of the mean square energy by raising the volume avoiding clipping AuthorsSylvain Coulange (Grenoble Alpes University),Takayuki Konishi (Waseda University)Tsuneo Kato (Doshisha University)Mariko Sugahara (Doshisha University)
本自发母语英语(L1 English)语料库收录美国大学生参与10分钟角色扮演的音频数据:两名考生将就争议性议题展开辩论式讨论。每位考生需承担指定角色,分别为支持议题的角色A,以及反对议题的角色B。发言前可享有至多10分钟的准备时间。尽管允许考生携带笔记,但对话过程中严禁朗读材料。考生需通过协商、交换观点,最终达成妥协。该口语交互任务的设计灵感源自CLES英语认证考试(CLES B2)的题型。
另有针对法国大学生的第二语言英语(L2 English)语料库,以及针对日本大学生的第二语言英语语料库,均可在此处获取。
### 语料库使用权限
本语料库仅可用于学术研究用途,如需获取使用权限,请联系作者。
### 语料内容
当前语料库共收录2小时语音数据,涉及14名考生、7段录音。每段录音中,角色A考生位于麦克风右侧,角色B考生位于麦克风左侧。更多元数据信息可查看recordings.csv与speakers.csv文件。
### 录制时间与地点
2024年2月,日本京都同志社大学。
### 录制设备参数
使用Zoom Handy Recorder H2n进行录制;采样参数为44.1 kHz、16位PCM立体声,码率1411 kb/s。
### 语料库自动标注内容
每段录音均配套包含说话人分割信息的TextGrid标注文件。该标注首先通过Pyannote说话人分割工具包(Pyannote Speaker Diarization Toolkit)自动完成,随后经人工审核确认。此外,TextGrid标注文件中的语音片段将通过「停顿与词汇重音处理流水线(Pause and Lexical Stress Processing Pipeline,PLSPP)」完成自动标注,具体涵盖自动语音识别与词级对齐(WhisperX)、音节核检测(De Jong等,2021)、词性标注(Spacy)、句法成分分析(Berkeley Neural Parser)、停顿位置分析以及多音节词的词汇重音标注。
### 预处理流程
音频校准:通过调整音量以避免音频削波,实现均方能量的统一。
### 语料作者
西尔万·库朗热(格勒诺布尔阿尔卑斯大学)、高桥隆之(早稻田大学)、加藤恒夫(同志社大学)、菅原万里子(同志社大学)
提供机构:
ORTOLANG (Open Resources and TOols for LANGuage) - www.ortolang.fr
创建时间:
2026-03-11



