bea2019st/wi_locness
收藏数据集卡片:Cambridge English Write & Improve + LOCNESS 数据集
数据集描述
数据集摘要
Write & Improve (Yannakoudakis et al., 2018) 是一个在线网络平台,旨在帮助非英语母语的学生提高写作能力。具体来说,来自世界各地的学生提交各种主题的信件、故事、文章和论文,W&I 系统提供即时反馈。自2014年上线以来,W&I 标注者已手动标注了部分提交内容,并为其分配了 CEFR 等级。
LOCNESS 语料库 (Granger, 1998) 由英语母语学生的论文组成。它最初由鲁汶大学英语语料库语言学中心的学者编纂。由于英语母语学生有时也会犯错误,我们请 W&I 标注者标注了 LOCNESS 的一个子集,以便研究人员可以测试其系统在各种英语水平和能力上的有效性。
支持的任务和排行榜
语法错误纠正(GEC)任务是自动纠正文本中的语法错误;例如 [I follows his advices -> I followed his advice]。它不仅可以用于帮助语言学习者提高写作技能,还可以提醒母语者注意意外的错误或打字错误。
该数据集旨在纠正书面文本中的所有类型的错误,包括语法、词汇和拼写错误。
最新的排行榜和提交信息可在以下 Codalab 竞赛中找到:https://competitions.codalab.org/competitions/20228
语言
数据集使用英语。
数据集结构
数据实例
wi 配置的示例:
json { "id": "1-140178", "userid": "21251", "cefr": "A2.i", "text": "My town is a medium size city with eighty thousand inhabitants. It has a high density population because its small territory. Despite of it is an industrial city, there are many shops and department stores. I recommend visiting the artificial lake in the certer of the city which is surrounded by a park. Pasteries are very common and most of them offer the special dessert from the city. There are a comercial zone along the widest street of the city where you can find all kind of establishments: banks, bars, chemists, cinemas, pet shops, restaurants, fast food restaurants, groceries, travel agencies, supermarkets and others. Most of the shops have sales and offers at least three months of the year: January, June and August. The quality of the products and services are quite good, because there are a huge competition, however I suggest you taking care about some fakes or cheats.", "edits": { "start": [13, 77, 104, 126, 134, 256, 306, 375, 396, 402, 476, 484, 579, 671, 774, 804, 808, 826, 838, 850, 857, 862, 868], "end": [24, 78, 104, 133, 136, 262, 315, 379, 399, 411, 480, 498, 588, 671, 777, 807, 810, 835, 845, 856, 861, 867, 873], "text": ["medium-sized", "-", " of", "Although", "", "center", None, "of", "is", "commercial", "kinds", "businesses", "grocers", " in", "is", "is", "", ". However,", "recommend", "be", "careful", "of", ""] } }
locness 配置的示例:
json { "id": "7-5819177", "cefr": "N", "text": "Boxing is a common, well known and well loved sport amongst most countries in the world however it is also punishing, dangerous and disliked to the extent that many people want it banned, possibly with good reason. Boxing is a dangerous sport, there are relatively common deaths, tragic injuries and even disease. All professional boxers are at risk from being killed in his next fight. If not killed then more likely paralysed. There have been a number of cases in the last ten years of the top few boxers having tragic losses throughout their ranks. This is just from the elite few, and theres more from those below them. More deaths would occur through boxing if it were banned. The sport would go underground, there would be no safety measures like gloves, a doctor, paramedics or early stopping of the fight if someone looked unable to continue. With this going on the people taking part will be dangerous, and on the streets. Dangerous dogs who were trained to kill and maim in similar underound dog fights have already proved deadly to innocent people, the new boxers could be even more at risk. Once boxing is banned and no-one grows up knowing it as acceptable there will be no interest in boxing and hopefully less all round interest in violence making towns and cities much safer places to live in, there will be less fighting outside pubs and clubs and less violent attacks with little or no reason. change the rules of boxing slightly would much improve the safety risks of the sport and not detract form the entertainment. There are all sorts of proposals, lighter and more cushioning gloves could be worn, ban punches to the head, headguards worn or make fights shorter, as most of the serious injuries occur in the latter rounds, these would all show off the boxers skill and tallent and still be entertaining to watch. Even if a boxer is a success and manages not to be seriously hurt he still faces serious consequences in later life diseases that attack the brains have been known to set in as a direct result of boxing, even Muhamed Ali, who was infamous(?) both for his boxing and his quick-witted intelligence now has Alzheimer disease and can no longer do many everyday acts. Many other sports are more dangerous than boxing, motor sports and even mountaineering has risks that are real. Boxers chose to box, just as racing drivers drive.", "edits": { "start": [24, 39, 52, 87, 242, 371, 400, 528, 589, 713, 869, 992, 1058, 1169, 1209, 1219, 1255, 1308, 1386, 1412, 1513, 1569, 1661, 1731, 1744, 1781, 1792, 1901, 1951, 2038, 2131, 2149, 2247, 2286], "end": [25, 40, 59, 95, 249, 374, 400, 538, 595, 713, 869, 1001, 1063, 1169, 1209, 1219, 1255, 1315, 1390, 1418, 1517, 1570, 1661, 1737, 1751, 1781, 1799, 1901, 1960, 2044, 2131, 2149, 2248, 2289], "text": ["-", "-", "in", ". However,", ". There", "their", ",", "among", "theres", " and", ",", "underground", ". The", ",", ",", ",", ",", ". There", "for", "Changing", "from", ";", ",", "later", ". These", "", "talent", ",", ". Diseases", ". Even", ",", "s", ";", "have"] } }
数据字段
数据集的字段包括:
id:文本的ID,字符串类型cefr:文本的 CEFR 等级,字符串类型userid:用户的IDtext:提交的文本内容,字符串类型edits:W&I 的编辑:start:每个编辑的起始索引,整数列表end:每个编辑的结束索引,整数列表text:每个编辑的文本内容,字符串列表from:每个编辑的原始文本,字符串列表
数据分割
| 名称 | 训练 | 验证 |
|---|---|---|
| wi | 3000 | 300 |
| locness | N/A | 50 |
数据集创建
策划理由
[更多信息需要]
源数据
初始数据收集和规范化
[更多信息需要]
源语言生产者是谁?
[更多信息需要]
标注
标注过程
[更多信息需要]
标注者是谁?
[更多信息需要]
个人和敏感信息
[更多信息需要]
使用数据集的注意事项
数据集的社会影响
[更多信息需要]
偏见的讨论
[更多信息需要]
其他已知限制
[更多信息需要]
附加信息
数据集策展人
[更多信息需要]
许可信息
Write & Improve 许可证:
Cambridge English Write & Improve (CEWI) Dataset Licence Agreement
-
通过下载此数据集和许可证,本许可协议即生效,有效日期为下载日期,由您,被许可方,和剑桥大学,许可方之间签订。
-
整个许可数据集的版权由许可方持有。被许可方不拥有或转让数据集的任何所有权或权益。
-
许可方特此授予被许可方非独占性、不可转让的权利,以非商业研究教育目的使用许可数据集。
-
非商业目的排除了数据集或从数据集中派生的信息用于或作为销售、提供销售、许可、租赁或出租的产品或服务的一部分。
-
被许可方应在所有基于数据集的出版物中承认使用许可数据集,通过引用以下出版物:
Helen Yannakoudakis, Øistein E. Andersen, Ardeshir Geranpayeh, Ted Briscoe and Diane Nicholls. 2018. Developing an automated writing placement system for ESL learners. Applied Measurement in Education.
-
被许可方可根据第3条发布数据集中少于100字的摘录。
-
许可方授予被许可方使用许可数据集的“原样”权利。许可方不作任何明示或暗示的保证、陈述或推荐。
-
本协议应根据英格兰法律解释和执行,英格兰法院具有专属管辖权。
LOCNESS 许可证:
LOCNESS Dataset Licence Agreement
-
语料库仅用于非商业目的。
-
基于语料库的部分或全部研究出版物应向鲁汶大学英语语料库语言学中心(CECL)致谢。出版物的扫描副本或抽印本也应发送至 sylviane.granger@uclouvain.be。
-
未经CECL特别授权,不得将语料库的任何部分分发给第三方。语料库仅可由同意许可条款的个人使用,或与其密切合作的研究人员或其监督下的学生使用,均隶属于同一机构,并在研究项目的框架内。
引用信息
@inproceedings{bryant-etal-2019-bea, title = "The {BEA}-2019 Shared Task on Grammatical Error Correction", author = "Bryant, Christopher and Felice, Mariano and Andersen, {O}istein E. and Briscoe, Ted", booktitle = "Proceedings of the Fourteenth Workshop on Innovative Use of NLP for Building Educational Applications", month = aug, year = "2019", address = "Florence, Italy", publisher = "Association for Computational Linguistics", url = "https://www.aclweb.org/anthology/W19-4406", doi = "10.18653/v1/W19-4406", pages = "52--75", abstract = "本文报告了BEA-2019共享任务关于语法错误纠正(GEC)的情况。与CoNLL-2014共享任务一样,参与者需要纠正测试数据中的所有类型的错误。BEA-2019共享任务的主要贡献之一是引入了新的数据集,Write{&}Improve+LOCNESS语料库,它代表了更广泛的英语水平和能力范围。另一个贡献是引入了轨道,控制参与者可用的标注数据量。系统在ERRANT F{_}0.5方面进行评估,这使我们能够报告更广泛的表现统计数据。竞赛在Codalab上举办,并继续开放盲测集的提交。", }
贡献
感谢 @aseifert 添加此数据集。




