JunSotohigashi/JWTD_misusing
收藏Hugging Face2025-12-10 更新2025-12-20 收录
下载链接:
https://hf-mirror.com/datasets/JunSotohigashi/JWTD_misusing
下载链接
链接失效反馈官方服务:
资源简介:
---
dataset_info:
- config_name: default
features:
- name: page
dtype: string
- name: title
dtype: string
- name: pre_str
dtype: string
- name: post_str
dtype: string
- name: pre_bart_likelihood
dtype: float64
- name: post_bart_likelihood
dtype: float64
- name: category
dtype: string
- name: text_head
dtype: string
- name: text_tail
dtype: string
- name: typo_type
dtype: string
- name: hash
dtype: string
splits:
- name: train
num_bytes: 104880633
num_examples: 291954
download_size: 65618677
dataset_size: 104880633
- config_name: filtered
features:
- name: page
dtype: string
- name: title
dtype: string
- name: pre_str
dtype: string
- name: post_str
dtype: string
- name: pre_bart_likelihood
dtype: float64
- name: post_bart_likelihood
dtype: float64
- name: category
dtype: string
- name: text_head
dtype: string
- name: text_tail
dtype: string
- name: typo_type
dtype: string
- name: hash
dtype: string
splits:
- name: train
num_bytes: 20498772.684210528
num_examples: 57062
- name: valid
num_bytes: 4392748.105263158
num_examples: 12228
- name: test
num_bytes: 4392388.868421053
num_examples: 12227
download_size: 22282280
dataset_size: 29283909.657894738
- config_name: filtered_surveyed
features:
- name: page
dtype: string
- name: title
dtype: string
- name: pre_str
dtype: string
- name: post_str
dtype: string
- name: pre_bart_likelihood
dtype: float64
- name: post_bart_likelihood
dtype: float64
- name: category
dtype: string
- name: text_head
dtype: string
- name: text_tail
dtype: string
- name: typo_type
dtype: string
- name: hash
dtype: string
splits:
- name: train
num_bytes: 917490.8947368421
num_examples: 2554
- name: valid
num_bytes: 194706.36842105264
num_examples: 542
- name: test
num_bytes: 194347.13157894736
num_examples: 541
download_size: 1002808
dataset_size: 1306544.3947368423
- config_name: post_processed
features:
- name: page
dtype: string
- name: title
dtype: string
- name: pre_str
dtype: string
- name: post_str
dtype: string
- name: pre_bart_likelihood
dtype: float64
- name: post_bart_likelihood
dtype: float64
- name: category
dtype: string
- name: text_head
dtype: string
- name: text_tail
dtype: string
- name: typo_type
dtype: string
- name: hash
dtype: string
splits:
- name: train
num_bytes: 29283909.657894738
num_examples: 81517
download_size: 20565161
dataset_size: 29283909.657894738
configs:
- config_name: default
data_files:
- split: train
path: data/train-*
- config_name: filtered
data_files:
- split: train
path: filtered/train-*
- split: valid
path: filtered/valid-*
- split: test
path: filtered/test-*
- config_name: filtered_surveyed
data_files:
- split: train
path: filtered_surveyed/train-*
- split: valid
path: filtered_surveyed/valid-*
- split: test
path: filtered_surveyed/test-*
- config_name: post_processed
data_files:
- split: train
path: post_processed/train-*
license: cc-by-sa-3.0
language:
- ja
size_categories:
- 10K<n<100K
---
# 日本語誤用データセット
## 概要
これは日本語の誤用事例を収集したデータセットです.
京都大学 言語メディア研究室によって公開されている[日本語Wikipedia入力誤りデータセット (v2)](https://nlp.ist.i.kyoto-u.ac.jp/?%E6%97%A5%E6%9C%AC%E8%AA%9EWikipedia%E5%85%A5%E5%8A%9B%E8%AA%A4%E3%82%8A%E3%83%87%E3%83%BC%E3%82%BF%E3%82%BB%E3%83%83%E3%83%88)をベースに,フィルタリング処理を行いました.
## 各セットについて
### default
JWTDのうち,categoryがkanji-conversion_aまたはkanji-conversion_bのものを抽出し,[tokyotech-llm/Llama-3.1-Swallow-70B-Instruct-v0.3](https://huggingface.co/tokyotech-llm/Llama-3.1-Swallow-70B-Instruct-v0.3)を使用して,CognitiveError(勘違いに起因する誤り)/KeystrokeError(入力操作に起因する誤り)タグを付与しました.
### post_processed
以下の条件でフィルタリングを行いました.
- typo_typeがCognitiveError
- pre_strまたはpost_strがtext_headに含まれない
- pre_str, post_strが固有名詞ではない(MeCab+Unidicを使用)
- head_textが20文字以上
### filtered
post_processedを,train:valid:test=70:15:15になるよう分割
### filtered_surveyed
filteredから一部を抜き出し,人間に対するアンケートを実施
提供机构:
JunSotohigashi



