UWV/wikipedia_nl_wim_with_dutch_schema
收藏Hugging Face2025-05-12 更新2025-10-18 收录
下载链接:
https://hf-mirror.com/datasets/UWV/wikipedia_nl_wim_with_dutch_schema
下载链接
链接失效反馈官方服务:
资源简介:
这个数据集是从荷兰语维基百科子集中派生出来的。我们过滤了文章,仅包含了文本长度在1,000到3,000字符之间的文章。从过滤后的集合中,我们随机选择了100,000篇文章,并为每篇文章生成了一个相应的 OWL(Web本体语言)模式,该模式是使用 GPT-4o 生成的。为了评估生成模式的质量,我们应用了一系列验证检查。在验证过程中,有2,479个模式因存在基本结构问题而被移除。最终的数据集包含了97,521个条目,每个条目由一篇荷兰语维基百科文本和一个机器生成的 OWL 模式组成。数据集的目的是支持对大型语言模型(LLMs)进行微调,以实现从自然语言文本自动生成知识图(KG)。
This dataset is derived from the Dutch-language subset of Wikipedia. We filtered the articles to include only those with a text length between 1,000 and 3,000 characters. From this filtered pool, we randomly selected 100,000 entries and enriched each with a corresponding OWL schema generated using GPT-4o. To assess the quality of the generated schemas, we applied a series of validation checks. During this validation process, 2,479 schemas were found to contain fundamental structural issues and were therefore removed from the dataset. The final dataset contains 97,521 entries, each consisting of a Dutch Wikipedia text paired with a machine-generated OWL schema. The primary objective of this dataset is to support the fine-tuning of large language models (LLMs) for automated Knowledge Graph (KG) generation from natural language texts.
提供机构:
UWV



