techiaith/finepdfs-cy-errors
收藏Hugging Face2026-01-30 更新2026-02-07 收录
下载链接:
https://hf-mirror.com/datasets/techiaith/finepdfs-cy-errors
下载链接
链接失效反馈官方服务:
资源简介:
该数据集包含使用rolmOCR从PDF中提取的威尔士语文本,并通过Cysill(威尔士语拼写检查器)自动标注了拼写和语法错误。数据集来源于HuggingFaceFW/finepdfs的威尔士语(cym_Latn)子集,仅包含使用rolmOCR提取器处理的文档。数据集包含文本内容、错误列表、错误数量等字段,可用于评估OCR质量、分析威尔士语常见错误、训练或评估拼写和语法检查模型等用途。
This dataset contains Welsh-language text extracted from PDFs using rolmOCR, with automated spelling and grammar error annotations generated by Cysill (the Welsh spell checker). The dataset is derived from the Welsh (cym_Latn) subset of HuggingFaceFW/finepdfs, filtered to include only documents processed with the rolmOCR extractor.
提供机构:
techiaith



