BW/spellcheck_benchmark_actualized
收藏Hugging Face2026-04-29 更新2026-05-03 收录
下载链接:
https://hf-mirror.com/datasets/BW/spellcheck_benchmark_actualized
下载链接
链接失效反馈官方服务:
资源简介:
俄语拼写检查基准数据集包含四个子数据集,每个子数据集由包含拼写错误的俄语句子及其对应的正确句子组成。数据集来源多样,包括社交媒体、互联网博客、GitHub提交、医学记录、文学、新闻和评论等。所有数据都经过了两阶段的人工标注流程,确保标注质量。数据集支持自动拼写校正任务,并提供了相关评估指标。数据集的结构包括数据实例、数据字段和数据分割的详细描述。数据集创建过程中,通过众包平台Toloka进行了两阶段的标注和验证,确保标注的准确性。数据集的使用考虑了社会影响、偏见和已知限制,并计划未来扩展到其他语言。数据集采用MIT许可证发布。
The Russian Spellcheck Benchmark dataset includes four sub-datasets, each consisting of pairs of sentences in Russian language where one sentence may contain spelling errors and the other is its corresponding correction. The datasets were gathered from various sources and domains including social networks, internet blogs, GitHub commits, medical anamnesis, literature, news, and reviews. All datasets underwent a two-stage manual labeling pipeline to ensure quality. The dataset supports the task of automatic spelling correction and provides relevant evaluation metrics. The dataset structure includes detailed descriptions of data instances, data fields, and data splits. The dataset creation process involved a two-stage annotation and validation via the Toloka crowdsourcing platform to ensure accuracy. Considerations for using the data include social impact, biases, and known limitations, with future plans to expand to other languages. The dataset is published under the MIT License.
提供机构:
BW



