Corpus Nummorum - Natural Language Processing Dataset

NIAID Data Ecosystem2026-05-02 收录

下载链接：

https://zenodo.org/record/13785725

下载链接

链接失效反馈

官方服务：

资源简介：

This Natural Language Processing (NLP) dataset contains a part of the MySQL Corpus Nummorum (CN) database. It covers Greek and Roman coins from ancient Thrace, Moesia Inferior, Troad and Mysia. The dataset contains 7,900 coin descriptions (or designs) created by the members of the CN project. Most of them are actual coin designs which can be linked through our relational database to the matching CN coins, types and their images. However, some of them (about 450) were only created for the training of the NLP model. There are nine different MySQL tables: data_coins: contains the data of all coins in the CN database data_coins_images: contains data of all images in the CN database data_coins_imagesets: contains the image pairs for the CN coins data_designs: contains every coin description in German, English and Bulgarian data_types: contains the data of alle coin types the Cn database nlp_hierarchy: contains the classes and subclasses of all entity categories nlp_list_entities: contains the data of all nlp entities in the CN database nlp_relation_extraction_en_v2: contains the annotations for the training of our NLP model nlp_training_designs: contains the coin designs used for training our NLP model Only tables 8 and 9 are important for NLP training, as they contain the descriptions and the corresponding annotations. The other tables (data_...) make it possible to link the coin descriptions with the various coins and types in the CN database. It is therefore also possible to provide the CN image data sets with the appropriate descriptions (CN - Coin Image Dataset and CN - Object Detection Coin Dataset). The other NLP tables provide information about the entities and relations in the descriptions and are used to create the RDF data for the nomisma.org portal. The tables of the relational CN database can be related via the various ID columns using foreign keys. For easier access without MySQL, we have attached two csv files with the descriptions in English and German and the annotations for the English designs. The annotations can be related to the descriptions via the Design_ID column. During the summer semester 2024, we held the "Data Challenge" event at our Department of Computer Science at the Goethe-University. Our students could choose between the Object Detection dataset and a Natural Language Processing dataset as their challenge. We gave the teams that decided to take part in the NLP challenge this dataset with the task of trying out their own ideas. Here are the results: LLM_RE Pipeline Coin description embeddings NLP coin app Now we would like to invite you to try out your own ideas and models on our coin data. If you have any questions or suggestions, please, feel free to contact us.

创建时间：

2024-10-17

5,000+

优质数据集

54 个

任务类型

进入经典数据集