five

openai/welsh-texts

收藏
Hugging Face2024-09-23 更新2025-04-08 收录
下载链接:
https://hf-mirror.com/datasets/openai/welsh-texts
下载链接
链接失效反馈
官方服务:
资源简介:
这个数据集包含了来自威尔士的各种印刷和手写材料,主要是威尔士语。材料包括《Drych y Prif Oesoedd》一书(关于威尔士早期历史,出版于1716年)、《Enwogion Cymreig》一书(记录威尔士历史上的杰出人物,出版于1907年)、《Cronicl Elis Gruffudd》手稿(关于历史,出版于1552年)、 bibliographic index cards(从微缩胶片的双色调数字化,包括低清晰度的“蓝色”集合和更清晰的“绿色”集合)、亨利·琼斯移民到纽约荷兰专利的移民信件(主要是威尔士语,手写)以及美国内战期间威斯康星志愿军(联邦军队)第23和第24团的威尔士信件。数据集以Parquet格式打包,每个来源都格式化为一系列图像文件(JPEG2000或PNG)。对于部分材料(Drych y Prif Oesoedd、Enowogion Cymreig、Bibliographic catalog index cards),我们还提供了GPT 4-o的OCR转录,但这些转录可能包含一些错误。

This dataset contains a variety of printed/handwritten material from Welsh sources, primarily in the Welsh language. It includes Drych y Prif Oesoedd (a book on the early history of Wales, published in 1716), Enwogion Cymreig (a book cataloging prominent figures in Welsh history, published in 1907), Cronicl Elis Gruffudd (a handwritten manuscript on history, published in 1552), bibliographic index cards (bitonal digitization from microfiche, with a blue set of lower legibility and a green set of better legibility), emigrant letters from Henry Jones who migrated to Holland Patent, New York in 1850 (mostly Welsh language, handwritten), and American Civil War letters from members of the Wisconsin Volunteers (Union Army) 23rd & 24th regiment. The dataset is packaged in Parquet format, with each source formatted as a collection of image files (JPEG2000 or PNG). For a subset of the material (Drych y Prif Oesoedd, Enowogion Cymreig, Bibliographic catalog index cards), we also provide OCR transcriptions from GPT 4-o, which are known to contain some errors but are provided as they may be helpful for text searching or as a baseline for developing other OCR systems.
提供机构:
openai
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作