mkhalifa/BioCite

Name: mkhalifa/BioCite
Creator: mkhalifa
Published: 2024-07-16 06:02:19
License: 暂无描述

Hugging Face2024-07-16 更新2024-06-11 收录

下载链接：

https://hf-mirror.com/datasets/mkhalifa/BioCite

下载链接

链接失效反馈

官方服务：

资源简介：

这是一个用于论文《Source-Aware Training Enables Knowledge Attribution in Language Models》中预训练阶段的合成数据集。数据集包含预训练和指令调优两个阶段，预训练阶段包含100K文档、408K事实/句子、5.7M令牌，平均每个文档包含4.1个句子和56.9个令牌；指令调优阶段包含186K示例和3.1M令牌。

This is the synthetic dataset used for pretraining in the paper Source-Aware Training Enables Knowledge Attribution in Language Models. The dataset includes two phases: pretraining and instruction tuning. The pretraining phase contains 100K documents, 408K facts/sentences, and 5.7M tokens, with an average of 4.1 sentences and 56.9 tokens per document. The instruction tuning phase contains 186K examples and 3.1M tokens.

提供机构：

mkhalifa

原始信息汇总