Rule-based Synthetic Data for Japanese GEC

NIAID Data Ecosystem2026-03-12 收录

下载链接：

https://zenodo.org/record/4276129

下载链接

链接失效反馈

官方服务：

资源简介：

Title: Rule-based Synthetic Data for Japanese GEC Dataset Contents: This dataset contains two parallel corpora intended for the training and evaluating of models for the NLP (natural language processing) subtask of Japanese GEC (grammatical error correction). These are as follows: Synthetic Corpus - *synthesized_data.tsv* This corpus file contains 2,179,130 parallel sentence pairs synthesized using the process described in [1]. Each line of the file consists of two sentences delimited by a tab. The first sentence is the erroneous sentence while the second is the corresponding correction. These paired sentences are derived from data scraped from the keyword-lookup site . The data within this file is primarily intended to serve as or augment a training set for a Japanese GEC model. Overall the sentences cover a broad array of primarily simple Japanese grammatical errors. Teacher Corpus - *teacher_data.tsv* This corpus file contains 6,345 parallel sentence pairs created via what we call the "teacher-sourcing" project [2]. The corpus sentences were created by Japanese language teachers, and this "teacher-sourcing" was funded by the Japan Foundation, Los Angeles. The overall format of the file is similar to that of *synthesized_data.tsv*, with each line containing an erroneous sentence and a corresponding correction separated by a comma. In addition, each erroneous sentence and correction sentence also contain pairs of characters that delimit the specific location within the sentence where the error/correction occur. For the erroneous sentence, these characters are `<` and `>`, while for the correction sentence, these are `(` and `)`. For example, consider the following sentence pair: - Error: <汚れる服>をあらいました。 - Correction: (汚れた服)をあらいました。 The delimiter characters indicate that the error phrase is `汚れる服` while the corresponding correction is `汚れた服` These paired sentences were written to mimic commonly grammatical errors produced by Japanese langauge learners; thus this file's data is primarily intended to serve as a evaluation set for Japanese GEC models. ______ In addition, the dataset contains the rule file used to generate the synthetic data within *synthesized_data.tsv*: ### Rule File - *rule_set.tsv* This file contains the 400 "syntactic rules" used to generate the data within *synthesized_data.tsv*. Each line contains a single rule, with different attributes delimited by tabs. Consult pages 41-66 of [1] for a more detailed analysis of these "syntactic rules" and the manner in which they are used to produce the synthetic data.

创建时间：

2020-12-31

5,000+

优质数据集

54 个

任务类型

进入经典数据集