bennexx/WJTSentDiL

Name: bennexx/WJTSentDiL
Creator: bennexx
Published: 2024-08-18 14:02:41
License: 暂无描述

Hugging Face2024-08-18 更新2024-07-22 收录

下载链接：

https://hf-mirror.com/datasets/bennexx/WJTSentDiL

下载链接

链接失效反馈

官方服务：

资源简介：

WJTSentDiL数据集（包含Wikipedia、JpWaC和Tatoeba句子的难度级别语料库）包含从各种在线来源获取的日语句子，并经过处理以使其更适合作为第二语言（L2）日语学习者的例句。数据集包括`main_data`、`tokenized_data`、`sentences_only`和`sources.csv`等配置，每个配置有不同的数据字段。处理过程包括去除重复项、限制标点符号和数字的比例等。统计信息显示，97%的句子来自日本维基百科，平均句子长度为26个标记，平均汉字比例为37%。

The WJTSentDiL dataset, a corpus of Wikipedia, JpWaC, and Tatoeba Sentences with Difficulty Level, contains Japanese sentences obtained from various online sources and processed to be more suitable as example sentences for L2 Japanese learners. The dataset includes files in different configurations, such as `main_data` containing Japanese sentences and their corresponding JLPT levels, `tokenized_data` containing tokenized and lemmatized sentences, `sentences_only` containing only Japanese sentences, and `sources.csv` recording the sources of the sentences. The sentences in the dataset are primarily from Japanese Wikipedia, with an average sentence length of 26 tokens and an average Kanji ratio of 37%. The dataset is licensed under cc-by-sa-4.0 and requires citing a specific research article.

提供机构：

bennexx

5,000+

优质数据集

54 个

任务类型

进入经典数据集