archit11/deepwiki10
收藏Hugging Face2025-11-03 更新2025-11-15 收录
下载链接:
https://hf-mirror.com/datasets/archit11/deepwiki10
下载链接
链接失效反馈官方服务:
资源简介:
DeepWiki CPT训练数据集是为了继续预训练而设计的,它包含有结构化的文档和代码标签。数据集提供了三种不同的格式以适应不同的训练目标,包括交织格式、分离格式和文档-代码对格式。文档和代码使用特定的标签进行标识,并提供了每种格式的统计数据。此外,还介绍了如何使用Datasets库加载数据集、训练循环示例、特殊标记处理和推荐的训练设置。数据来源于juspay/hyperswitch的wiki和代码库。
The DeepWiki CPT Training Dataset is designed for continued pre-training, containing structured tags for documentation and code. The dataset is available in three different formats to cater to different training objectives: interleaved format, separate format, and doc-code pairs format. Documentation and code are tagged distinctly, and statistics for each format are provided. It also includes instructions on how to load the dataset using the Datasets Library, an example training loop, special token handling, and recommended training settings. The data来源 from the juspay/hyperswitch wiki and code repository.
提供机构:
archit11



