morganstanley/q_pretrained_dataset
收藏Hugging Face2025-08-07 更新2025-09-13 收录
下载链接:
https://hf-mirror.com/datasets/morganstanley/q_pretrained_dataset
下载链接
链接失效反馈官方服务:
资源简介:
Q代码预训练语料库是一个专门为大型语言模型和代码模型预训练而整理的Q编程语言代码和文档集合。该数据集包含160万以上的Q语言令牌和500万以上的字符。数据集由342个训练块和39个验证块组成,来源包括开源Q仓库、官方KDB+/Q文档和教程,以及手工整理的代码片段和脚本。数据以纯净的Q语言形式存在,没有混合Python或其他非代码噪声。所有源代码都采用MIT或Apache 2.0许可,适用于研究和商业用途。数据集涵盖了分析、时间序列、数据库查询和实用工具等领域的代码,并通过自动评分和手动审核确保了数据质量。
The Q Code Pretraining Corpus is a collection of Q programming language code and documentation curated for pretraining large language models and code models. The dataset contains over 1.6 million Q tokens and more than 5 million characters. It consists of 342 training chunks and 39 validation chunks, sourced from open-source Q repositories, official KDB+/Q documentation and tutorials, and hand-curated code snippets and scripts. The data is in pure Q language without any mixed Python or non-code noise. All source code is licensed under MIT or Apache 2.0, suitable for both research and commercial use. The dataset covers code from analytics, time-series, database queries, and utilities, and ensures data quality through automated scoring and manual review.
提供机构:
morganstanley



