Corpus Nummorum - Natural Language Processing Dataset
收藏NIAID Data Ecosystem2026-05-02 收录
下载链接:
https://zenodo.org/record/13785725
下载链接
链接失效反馈官方服务:
资源简介:
This Natural Language Processing (NLP) dataset contains a part of the MySQL Corpus Nummorum (CN) database. It covers Greek and Roman coins from ancient Thrace, Moesia Inferior, Troad and Mysia.
The dataset contains 7,900 coin descriptions (or designs) created by the members of the CN project. Most of them are actual coin designs which can be linked through our relational database to the matching CN coins, types and their images. However, some of them (about 450) were only created for the training of the NLP model.
There are nine different MySQL tables:
data_coins: contains the data of all coins in the CN database
data_coins_images: contains data of all images in the CN database
data_coins_imagesets: contains the image pairs for the CN coins
data_designs: contains every coin description in German, English and Bulgarian
data_types: contains the data of alle coin types the Cn database
nlp_hierarchy: contains the classes and subclasses of all entity categories
nlp_list_entities: contains the data of all nlp entities in the CN database
nlp_relation_extraction_en_v2: contains the annotations for the training of our NLP model
nlp_training_designs: contains the coin designs used for training our NLP model
Only tables 8 and 9 are important for NLP training, as they contain the descriptions and the corresponding annotations. The other tables (data_...) make it possible to link the coin descriptions with the various coins and types in the CN database. It is therefore also possible to provide the CN image data sets with the appropriate descriptions (CN - Coin Image Dataset and CN - Object Detection Coin Dataset). The other NLP tables provide information about the entities and relations in the descriptions and are used to create the RDF data for the nomisma.org portal. The tables of the relational CN database can be related via the various ID columns using foreign keys.
For easier access without MySQL, we have attached two csv files with the descriptions in English and German and the annotations for the English designs. The annotations can be related to the descriptions via the Design_ID column.
During the summer semester 2024, we held the "Data Challenge" event at our Department of Computer Science at the Goethe-University. Our students could choose between the Object Detection dataset and a Natural Language Processing dataset as their challenge. We gave the teams that decided to take part in the NLP challenge this dataset with the task of trying out their own ideas. Here are the results:
LLM_RE Pipeline
Coin description embeddings
NLP coin app
Now we would like to invite you to try out your own ideas and models on our coin data.
If you have any questions or suggestions, please, feel free to contact us.
创建时间:
2024-10-17



