five

bea2019st/wi_locness

收藏
Hugging Face2024-01-18 更新2024-06-15 收录
下载链接:
https://hf-mirror.com/datasets/bea2019st/wi_locness
下载链接
链接失效反馈
官方服务:
资源简介:
Cambridge English Write & Improve + LOCNESS数据集是一个用于语法错误纠正的英文数据集。Write & Improve是一个在线平台,帮助非母语英语学生提高写作能力,学生提交文章后,系统会提供即时反馈,并由人工标注CEFR等级。LOCNESS语料库包含母语英语学生写的文章,并由W&I标注者进行标注,以便研究人员测试其系统在不同英语水平和能力上的效果。数据集支持的任务包括纠正文本中的语法、词汇和拼写错误。数据集包含两个配置:wi和locness,分别对应不同的数据来源和标注方式。

The Cambridge English Write & Improve + LOCNESS Dataset is an English-language dataset designed for grammatical error correction. Write & Improve is an online platform that assists non-native English-speaking students to improve their writing skills. After students submit their essays, the system provides real-time feedback and manually annotates the CEFR levels of the submissions. The LOCNESS corpus contains essays written by native English-speaking students, which are annotated by W&I annotators, enabling researchers to test their systems across different English proficiency levels and capabilities. The dataset supports tasks including correcting grammatical, lexical and spelling errors in texts. The dataset has two configurations: `wi` and `locness`, corresponding to different data sources and annotation methods respectively.
提供机构:
bea2019st
原始信息汇总

数据集卡片:Cambridge English Write & Improve + LOCNESS 数据集

数据集描述

数据集摘要

Write & Improve (Yannakoudakis et al., 2018) 是一个在线网络平台,旨在帮助非英语母语的学生提高写作能力。具体来说,来自世界各地的学生提交各种主题的信件、故事、文章和论文,W&I 系统提供即时反馈。自2014年上线以来,W&I 标注者已手动标注了部分提交内容,并为其分配了 CEFR 等级。

LOCNESS 语料库 (Granger, 1998) 由英语母语学生的论文组成。它最初由鲁汶大学英语语料库语言学中心的学者编纂。由于英语母语学生有时也会犯错误,我们请 W&I 标注者标注了 LOCNESS 的一个子集,以便研究人员可以测试其系统在各种英语水平和能力上的有效性。

支持的任务和排行榜

语法错误纠正(GEC)任务是自动纠正文本中的语法错误;例如 [I follows his advices -> I followed his advice]。它不仅可以用于帮助语言学习者提高写作技能,还可以提醒母语者注意意外的错误或打字错误。

该数据集旨在纠正书面文本中的所有类型的错误,包括语法、词汇和拼写错误。

最新的排行榜和提交信息可在以下 Codalab 竞赛中找到:https://competitions.codalab.org/competitions/20228

语言

数据集使用英语。

数据集结构

数据实例

wi 配置的示例:

json { "id": "1-140178", "userid": "21251", "cefr": "A2.i", "text": "My town is a medium size city with eighty thousand inhabitants. It has a high density population because its small territory. Despite of it is an industrial city, there are many shops and department stores. I recommend visiting the artificial lake in the certer of the city which is surrounded by a park. Pasteries are very common and most of them offer the special dessert from the city. There are a comercial zone along the widest street of the city where you can find all kind of establishments: banks, bars, chemists, cinemas, pet shops, restaurants, fast food restaurants, groceries, travel agencies, supermarkets and others. Most of the shops have sales and offers at least three months of the year: January, June and August. The quality of the products and services are quite good, because there are a huge competition, however I suggest you taking care about some fakes or cheats.", "edits": { "start": [13, 77, 104, 126, 134, 256, 306, 375, 396, 402, 476, 484, 579, 671, 774, 804, 808, 826, 838, 850, 857, 862, 868], "end": [24, 78, 104, 133, 136, 262, 315, 379, 399, 411, 480, 498, 588, 671, 777, 807, 810, 835, 845, 856, 861, 867, 873], "text": ["medium-sized", "-", " of", "Although", "", "center", None, "of", "is", "commercial", "kinds", "businesses", "grocers", " in", "is", "is", "", ". However,", "recommend", "be", "careful", "of", ""] } }

locness 配置的示例:

json { "id": "7-5819177", "cefr": "N", "text": "Boxing is a common, well known and well loved sport amongst most countries in the world however it is also punishing, dangerous and disliked to the extent that many people want it banned, possibly with good reason. Boxing is a dangerous sport, there are relatively common deaths, tragic injuries and even disease. All professional boxers are at risk from being killed in his next fight. If not killed then more likely paralysed. There have been a number of cases in the last ten years of the top few boxers having tragic losses throughout their ranks. This is just from the elite few, and theres more from those below them. More deaths would occur through boxing if it were banned. The sport would go underground, there would be no safety measures like gloves, a doctor, paramedics or early stopping of the fight if someone looked unable to continue. With this going on the people taking part will be dangerous, and on the streets. Dangerous dogs who were trained to kill and maim in similar underound dog fights have already proved deadly to innocent people, the new boxers could be even more at risk. Once boxing is banned and no-one grows up knowing it as acceptable there will be no interest in boxing and hopefully less all round interest in violence making towns and cities much safer places to live in, there will be less fighting outside pubs and clubs and less violent attacks with little or no reason. change the rules of boxing slightly would much improve the safety risks of the sport and not detract form the entertainment. There are all sorts of proposals, lighter and more cushioning gloves could be worn, ban punches to the head, headguards worn or make fights shorter, as most of the serious injuries occur in the latter rounds, these would all show off the boxers skill and tallent and still be entertaining to watch. Even if a boxer is a success and manages not to be seriously hurt he still faces serious consequences in later life diseases that attack the brains have been known to set in as a direct result of boxing, even Muhamed Ali, who was infamous(?) both for his boxing and his quick-witted intelligence now has Alzheimer disease and can no longer do many everyday acts. Many other sports are more dangerous than boxing, motor sports and even mountaineering has risks that are real. Boxers chose to box, just as racing drivers drive.", "edits": { "start": [24, 39, 52, 87, 242, 371, 400, 528, 589, 713, 869, 992, 1058, 1169, 1209, 1219, 1255, 1308, 1386, 1412, 1513, 1569, 1661, 1731, 1744, 1781, 1792, 1901, 1951, 2038, 2131, 2149, 2247, 2286], "end": [25, 40, 59, 95, 249, 374, 400, 538, 595, 713, 869, 1001, 1063, 1169, 1209, 1219, 1255, 1315, 1390, 1418, 1517, 1570, 1661, 1737, 1751, 1781, 1799, 1901, 1960, 2044, 2131, 2149, 2248, 2289], "text": ["-", "-", "in", ". However,", ". There", "their", ",", "among", "theres", " and", ",", "underground", ". The", ",", ",", ",", ",", ". There", "for", "Changing", "from", ";", ",", "later", ". These", "", "talent", ",", ". Diseases", ". Even", ",", "s", ";", "have"] } }

数据字段

数据集的字段包括:

  • id:文本的ID,字符串类型
  • cefr:文本的 CEFR 等级,字符串类型
  • userid:用户的ID
  • text:提交的文本内容,字符串类型
  • edits:W&I 的编辑:
    • start:每个编辑的起始索引,整数列表
    • end:每个编辑的结束索引,整数列表
    • text:每个编辑的文本内容,字符串列表
    • from:每个编辑的原始文本,字符串列表

数据分割

名称 训练 验证
wi 3000 300
locness N/A 50

数据集创建

策划理由

[更多信息需要]

源数据

初始数据收集和规范化

[更多信息需要]

源语言生产者是谁?

[更多信息需要]

标注

标注过程

[更多信息需要]

标注者是谁?

[更多信息需要]

个人和敏感信息

[更多信息需要]

使用数据集的注意事项

数据集的社会影响

[更多信息需要]

偏见的讨论

[更多信息需要]

其他已知限制

[更多信息需要]

附加信息

数据集策展人

[更多信息需要]

许可信息

Write & Improve 许可证:

Cambridge English Write & Improve (CEWI) Dataset Licence Agreement

  1. 通过下载此数据集和许可证,本许可协议即生效,有效日期为下载日期,由您,被许可方,和剑桥大学,许可方之间签订。

  2. 整个许可数据集的版权由许可方持有。被许可方不拥有或转让数据集的任何所有权或权益。

  3. 许可方特此授予被许可方非独占性、不可转让的权利,以非商业研究教育目的使用许可数据集。

  4. 非商业目的排除了数据集或从数据集中派生的信息用于或作为销售、提供销售、许可、租赁或出租的产品或服务的一部分。

  5. 被许可方应在所有基于数据集的出版物中承认使用许可数据集,通过引用以下出版物:

    Helen Yannakoudakis, Øistein E. Andersen, Ardeshir Geranpayeh, Ted Briscoe and Diane Nicholls. 2018. Developing an automated writing placement system for ESL learners. Applied Measurement in Education.

  6. 被许可方可根据第3条发布数据集中少于100字的摘录。

  7. 许可方授予被许可方使用许可数据集的“原样”权利。许可方不作任何明示或暗示的保证、陈述或推荐。

  8. 本协议应根据英格兰法律解释和执行,英格兰法院具有专属管辖权。

LOCNESS 许可证:

LOCNESS Dataset Licence Agreement

  1. 语料库仅用于非商业目的。

  2. 基于语料库的部分或全部研究出版物应向鲁汶大学英语语料库语言学中心(CECL)致谢。出版物的扫描副本或抽印本也应发送至 sylviane.granger@uclouvain.be

  3. 未经CECL特别授权,不得将语料库的任何部分分发给第三方。语料库仅可由同意许可条款的个人使用,或与其密切合作的研究人员或其监督下的学生使用,均隶属于同一机构,并在研究项目的框架内。

引用信息

@inproceedings{bryant-etal-2019-bea, title = "The {BEA}-2019 Shared Task on Grammatical Error Correction", author = "Bryant, Christopher and Felice, Mariano and Andersen, {O}istein E. and Briscoe, Ted", booktitle = "Proceedings of the Fourteenth Workshop on Innovative Use of NLP for Building Educational Applications", month = aug, year = "2019", address = "Florence, Italy", publisher = "Association for Computational Linguistics", url = "https://www.aclweb.org/anthology/W19-4406", doi = "10.18653/v1/W19-4406", pages = "52--75", abstract = "本文报告了BEA-2019共享任务关于语法错误纠正(GEC)的情况。与CoNLL-2014共享任务一样,参与者需要纠正测试数据中的所有类型的错误。BEA-2019共享任务的主要贡献之一是引入了新的数据集,Write{&}Improve+LOCNESS语料库,它代表了更广泛的英语水平和能力范围。另一个贡献是引入了轨道,控制参与者可用的标注数据量。系统在ERRANT F{_}0.5方面进行评估,这使我们能够报告更广泛的表现统计数据。竞赛在Codalab上举办,并继续开放盲测集的提交。", }

贡献

感谢 @aseifert 添加此数据集。

搜集汇总
数据集介绍
main_image_url
构建方式
该数据集由Cambridge English Write & Improve平台和LOCNESS语料库结合构建而成。Write & Improve平台收集了来自全球非英语母语学生的写作提交,并由专家进行手动标注,分配CEFR级别。LOCNESS语料库则包含了英语母语学生的作文,由W&I注释者对其进行部分标注,以确保数据集涵盖了从非母语到母语的广泛英语水平和能力。
特点
该数据集的主要特点在于其广泛的语言水平覆盖,包括非母语和母语学生的写作样本,以及详细的错误标注。每个样本不仅包含原始文本,还附有详细的编辑信息,包括编辑的起始和结束位置及其替换文本。此外,数据集提供了CEFR级别的标注,有助于研究者评估和比较不同语言水平下的错误类型和纠正效果。
使用方法
该数据集适用于语法错误纠正(GEC)任务,研究者可以利用其进行模型训练和评估。使用时,研究者可以根据需要选择不同的配置(如wi或locness),并利用提供的训练和验证集进行模型开发。数据集的详细编辑信息为模型提供了丰富的上下文,有助于提高错误识别和纠正的准确性。此外,数据集的CEFR级别标注可用于分析和比较不同语言水平下的错误模式。
背景与挑战
背景概述
Cambridge English Write & Improve + LOCNESS数据集是由剑桥大学开发的,旨在为非母语英语学习者提供写作辅助。该数据集的核心研究问题是如何有效地识别和纠正英语写作中的语法错误。自2014年Write & Improve平台上线以来,研究人员通过手动标注部分学生提交的作文,并为其分配CEFR级别,从而构建了这一数据集。此外,LOCNESS语料库由比利时鲁汶大学的研究人员编纂,包含母语英语学生的作文。通过将这两个语料库结合,研究人员希望测试其系统在不同英语水平和能力上的有效性。
当前挑战
该数据集面临的挑战主要集中在语法错误纠正(GEC)任务上。首先,如何准确识别和纠正各种类型的错误,包括语法、词汇和拼写错误,是一个复杂的问题。其次,数据集的构建过程中,如何确保标注的准确性和一致性,以及如何处理来自不同语言背景学生的多样化错误类型,也是一大挑战。此外,数据集的使用需遵循严格的非商业用途限制,这可能限制了其在实际应用中的推广和使用。
常用场景
经典使用场景
在语言学习与教育领域,bea2019st/wi_locness数据集被广泛用于语法错误纠正(GEC)任务。该数据集结合了Write & Improve平台上的非母语英语学生作品与LOCNESS语料库中的母语学生作品,提供了丰富的文本样本及其对应的语法错误标注。研究者利用这些数据训练和评估语法错误纠正模型,旨在提升自动写作辅助工具的准确性和实用性。
实际应用
在实际应用中,bea2019st/wi_locness数据集支持开发和优化在线写作辅助工具,如Write & Improve平台。这些工具能够为全球范围内的英语学习者提供即时、个性化的写作反馈,帮助他们提升写作技能。此外,该数据集还可用于开发教育软件和应用程序,以支持课堂教学和自主学习。
衍生相关工作
基于bea2019st/wi_locness数据集,研究者们开展了多项相关工作,包括但不限于语法错误检测与纠正模型的改进、多语言语法错误处理技术的探索,以及写作辅助系统的人机交互优化。这些研究不仅提升了现有技术的性能,还为未来的教育技术发展提供了新的思路和方法。
以上内容由遇见数据集搜集并总结生成
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作