five

bennexx/WJTSentDiL

收藏
Hugging Face2024-08-18 更新2024-07-22 收录
下载链接:
https://hf-mirror.com/datasets/bennexx/WJTSentDiL
下载链接
链接失效反馈
官方服务:
资源简介:
WJTSentDiL数据集(包含Wikipedia、JpWaC和Tatoeba句子的难度级别语料库)包含从各种在线来源获取的日语句子,并经过处理以使其更适合作为第二语言(L2)日语学习者的例句。数据集包括`main_data`、`tokenized_data`、`sentences_only`和`sources.csv`等配置,每个配置有不同的数据字段。处理过程包括去除重复项、限制标点符号和数字的比例等。统计信息显示,97%的句子来自日本维基百科,平均句子长度为26个标记,平均汉字比例为37%。

The WJTSentDiL dataset, a corpus of Wikipedia, JpWaC, and Tatoeba Sentences with Difficulty Level, contains Japanese sentences obtained from various online sources and processed to be more suitable as example sentences for L2 Japanese learners. The dataset includes files in different configurations, such as `main_data` containing Japanese sentences and their corresponding JLPT levels, `tokenized_data` containing tokenized and lemmatized sentences, `sentences_only` containing only Japanese sentences, and `sources.csv` recording the sources of the sentences. The sentences in the dataset are primarily from Japanese Wikipedia, with an average sentence length of 26 tokens and an average Kanji ratio of 37%. The dataset is licensed under cc-by-sa-4.0 and requires citing a specific research article.
提供机构:
bennexx
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作