gplsi/DBpediaOntoTrain
收藏Hugging Face2025-08-01 更新2025-08-09 收录
下载链接:
https://hf-mirror.com/datasets/gplsi/DBpediaOntoTrain
下载链接
链接失效反馈官方服务:
资源简介:
DBpediaOntoTrain是一个质量分段的本体数据集,用于大型语言模型(LLM)的持续预训练。该数据集包含1,766个以Turtle格式存储的OWL本体,每个本体都经过语义质量指标分析并使用LLaMA 3.2分词器分词。数据集根据质量分数(QS)排序,并提供了累计词计数和百分比,以便进行质量感知训练。
DBpediaOntoTrain is a quality-segmented ontology dataset prepared for the continual pretraining of Large Language Models (LLMs). It consists of 1,766 OWL ontologies in Turtle format, each analyzed with semantic quality metrics and tokenized using the LLaMA 3.2 tokenizer. The dataset is sorted by Quality Score (QS) and includes cumulative token counts and percentages for quality-aware training.
提供机构:
gplsi



