ETS Corpus of Non-Native Written English

Name: ETS Corpus of Non-Native Written English
Creator: Linguistic Data Consortium
Published: 2021-07-01 16:26:20
License: 暂无描述

DataCite Commons2021-07-01 更新2025-04-16 收录

下载链接：

https://catalog.ldc.upenn.edu/LDC2014T06

下载链接

链接失效反馈

官方服务：

资源简介：

<h3>Introduction</h3><br> <p>ETS Corpus of Non-Native Written English was developed by <a href="https://www.ets.org/" rel="nofollow">Educational Testing Service</a> and is comprised of 12,100 English essays written by speakers of 11 non-English native languages as part of an international test of academic English proficiency, <a href="http://www.ets.org/toefl/ibt/about" rel="nofollow">TOEFL</a> (Test of English as a Foreign Language). The test includes reading, writing, listening, and speaking sections and is delivered by computer in a secure test center. This release contains 1,100 essays for each of the 11 native languages sampled from eight topics with information about the score level (low/medium/high) for each essay.</p><br> <p>The corpus was developed with the specific task of native language identification in mind, but is likely to support tasks and studies in the educational domain, including grammatical error detection and correction and automatic essay scoring, in addition to a broad range of research studies in the fields of natural language processing and corpus linguistics. For the task of native language identification, the following division is recommended: 82% as training data, 9% as development data and 9% as test data, split according to the file IDs accompanying the data set.</p><br> <h3>Data</h3><br> <p>The data is sampled from essays written in 2006 and 2007 by test takers whose native languages were Arabic, Chinese, French, German, Hindi, Italian, Japanese, Korean, Spanish, Telugu, and Turkish. The essays are presented in both original raw and tokenized forms and presented in UTF-8 formatted text files. Also included are the prompts (topics) for the essays and metadata about the test takers' proficiency level.</p><br> <h3>Samples</h3><br> <p>Please view this <a href="desc/addenda/LDC2014T06.orig.txt" rel="nofollow">original</a> and <a href="desc/addenda/LDC2014T06.token.txt" rel="nofollow">tokenized</a> samples.</p><br> <h3>Updates</h3><br> <p>In July 2014, 1,100 files were added to the corpus, bringing the total number of tokenized and original files to 12,100. All copies distributed after that date contain the full data set.</p></br> Portions © 2014 Educational Testing Service, © 2014 Trustees of the University of Pennsylvania

<h3>引言</h3><br> <p>ETS非母语书面英语语料库由<a href="https://www.ets.org/" rel="nofollow">教育考试服务中心（Educational Testing Service，ETS）</a>开发，包含12100篇英语作文，这些作文来自11种非英语母语使用者，作为学术英语能力国际测试——<a href="http://www.ets.org/toefl/ibt/about" rel="nofollow">托福（Test of English as a Foreign Language，TOEFL）</a>的一部分完成。该测试涵盖阅读、写作、听力与口语四个部分，采用计算机化考试形式，在合规考试中心进行。本次发布的数据集涵盖11种母语各1100篇作文，样本取自8个话题，且每篇作文均标注了分数等级（低/中/高）。</p><br> <p>本语料库的开发初衷是服务于母语识别任务，但同样可支持教育领域的多项任务与研究，包括语法错误检测与纠正、自动作文评分，此外还可用于自然语言处理与语料库语言学领域的各类广泛研究。针对母语识别任务，推荐按数据集附带的文件ID进行划分：82%作为训练集，9%作为开发集，剩余9%作为测试集。</p><br> <h3>数据</h3><br> <p>本数据集的样本取自2006年与2007年的考生作文，这些考生的母语分别为阿拉伯语、汉语、法语、德语、印地语、意大利语、日语、韩语、西班牙语、泰卢固语与土耳其语。所有作文均提供原始未加工版本与分词（Token）后的版本，存储为UTF-8编码的文本文件。数据集同时包含作文题目（话题）以及考生英语能力水平的元数据。</p><br> <h3>样本示例</h3><br> <p>请查看<a href="desc/addenda/LDC2014T06.orig.txt" rel="nofollow">原始版本样本</a>与<a href="desc/addenda/LDC2014T06.token.txt" rel="nofollow">分词版本样本</a>。</p><br> <h3>更新记录</h3><br> <p>2014年7月，本语料库新增1100份文件，使分词版本与原始版本的总文件数达到12100份。2014年7月之后分发的所有副本均包含完整数据集。</p></br> 部分内容 © 2014 教育考试服务中心，© 2014 宾夕法尼亚大学董事会

提供机构：

Linguistic Data Consortium

创建时间：

2020-11-30

搜集汇总

数据集介绍