Dataset for Logigal-layout analysis on historical newspapers
收藏NIAID Data Ecosystem2026-03-13 收录
下载链接:
https://zenodo.org/record/5560765
下载链接
链接失效反馈官方服务:
资源简介:
Dataset for Logical-layout analysis on French Historical Newspapers
This is a dataset for training and testing logical-layout analysis and recognition system on French historical documents. The original data is part of the "Fond régional: Franche Comté", which is curated by Gallica, the digital portal of the Bibliothèque nationale de France (BnF).
Description
This dataset is divided into a train and a test set. The train and test datasets have been designed to cover as much as possible the various possible layouts that exist in the "Fond régional: Franche Comté" dataset. To do so, we have divided them into three layout-types:
1c: documents where the text is displayed in one column, as in books;
2c: documents where the text is displayed into two columns;
3c+: documents where there are at least 3 columns of text, as in newspapers.
Each of these folders contain subfolders starting with the letters ‘cb’. These are the identifier of a newspaper collection such as « Le Petit Semeur ». An XML describing the collection is contained in each of these folder, but is not linked to the logical-layout analysis purpose. They also contain subfolders starting with the letters ‘bpt’, which contain the following files:
XXX.xml : the original XML film as gathered from Gallica.
truelabels_block: A CSV file where the True labels for each TextBlock tag is given. Each line contains the page, the block_id, the first and last line of text of the block and its label
truelabels_line: A CSV file where the True labels for each TextLine tag is given. Each line contains the page, the line_id, the text of the line and its label
XXX_docbook.xml: the document after having been processed by a Logical Layout recognition system.
The original XML gathers multiple information about the document, especially metadata (described using the DublinCore schema), the page numbering and the OCR which is described with the XML ALTO format. As such, the files already provide the physical layout analysis and the reading order of the documents.
The XML ALTO format provides the text content and physical layout of documents in the following manner.
The OCR output for the whole document is available in a PrintSpace tag. Lines of text are contained in TextLine tags,
which in their turn contain String tags for words and SP tags for spaces. TextLine tags are grouped into blocks in TextBlock tags. Sometimes, TextBlock tags are also grouped into ComposedBlock tags. TextBlock and TextLine tags have the following attributes:
Id: the tag's identifier
Height, Width: the text height and width
Vpos: the vertical position of the text on the page. The higher the value, the lower the word is on the page
Hpos: the horizontal position of the text on the page. The higher the value, the further on the right the text is on the page
Language: the language of the text (only for TextBlock tags).
The blocks of text are labelled either as Text, Title, Header or Other. The lines of text are labelled either as Text, Firstline (to indicate the first line of a paragraph), Title, Header or Other. These labels are used in the truelabel_lines.csv, trulabel_blocks.csv and XXX_docbook.xml files.
You can access the original scan of every document on the Gallica website. To do so, use the following URL by replacing the part with the id of the document (eg: bpt6k76208717) : https://gallica.bnf.fr/ark:/12148/
创建时间:
2021-12-03



