five

French Fiction of the 16-18th century

收藏
NIAID Data Ecosystem2026-03-13 收录
下载链接:
https://zenodo.org/record/5770865
下载链接
链接失效反馈
官方服务:
资源简介:
A corpus containing all digitized French novels from the beginning of print (the first entry is from 1473) to the 18th century. French novels of the period have been identified using the Y2 quote of the French National Library Catalog that has served to classify past and present collections of novels in France from 1730 to 1996. Combined use of digitized sources from Gallica, Google Books, Archive.org and other digital library made it possible to attain a high representativeness: 78% of the novels of the 1450-1600 and 68% of the novels of the 1600-1700 have been retrieved. The corpus is part of a planned collection of French Fiction (1050-1920) that will also integrate Geste (a medieval corpus curated by Jean-Baptiste Camps) and Fictions littéraires de Gallica (a 1600-1950 corpus extracted from Gallica with Pierre-Carl Langlais, with a strong focus on the 19th century). While it aims to bridge the two pre-existing part of the collection, it is also a more ambitious experiment of systematic collection of existing digital sources. The project remains very much a work-in-progress at this stage. Occasional errors in the metadata and the identification of the unique work are still possible. Besides, the identification of multi-volumes remain challenging in digital sources beyond Gallica. The repository includes the following files: The metadata of available and unavailable file for all novels identified in the 16th century (corpus_roman_metadata_16.tsv) and the 17th century (corpus_roman_metadata_17.tsv). All the editions have been temptatively assigned to a unique work (work_id) based on theo title, the author and additional metadata. This dataset includes both information on a specific digitized volume (volume_file, volume_title, volume_date, volume_edition_id) and on the earliest edition of the work recorded by the French national library (first_edition, first_edition_titre, first_edition_date), as well as the identification of the author (prenom_auteur, nom_auteur) and the complete list of all available edition (list_edition_bnf). When digitized files are not available for a given work, the information on the volume is replaced with a missing data mark (NA).An edition-based dataset was initially contemplated, but it turned out to be much harder than expected: the French National Catalog do not record all the available editions and runs of the period and it would have been necessary to check and create unique edition IDs for numerous Google Books volume. The complete text of the novels when available (corpus_roman_16_text.tsv and corpus_roman_17_text.tsv). The use of contemporary OCR software on early modern text have long yielded poor results as words, typographies, and even letters were markedly different than the corpus theses software were trained on. Consequently, numerous volumes from Gallica have simply no OCR, as the results were below the quality requirement of the digital library. New historical OCR models will mmake it possible to create a reliable OCR on the entire corpus. The dataset includes all the text at the page-level whenever there is some text on the page. Page numbering is based on the absolute numbering of the file, not on the original numbering of the edition. A classified dataset of 159 novels from the 17th century in four major genres of the period: chilvaric novel, love novel, historical novel and comic novel. The classification is based on an exceptional source of 1731, the catalog of novels from Nicolas Lenglet du Fresnoy (published as the second volume of De l'usage des romans). The classified dataset include both the text (as in corpus_roman_17_text.tsv) at the page-level and the lemmatization realized with a trained syntaxic model on 17th century French (https://github.com/e-ditiones/LEM17) A classification model created with the classified dataset. This "Fresnoy" model has a high accuracy (93%) which can be parrtly attributed to overfitting (as there is a limited amount of novels per genre). The model can be reused with Tidysupervise, a small R extension to create supervised text models.
创建时间:
2021-12-10
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作