five

jcarbonnell/preTrainingNEAR

收藏
Hugging Face2024-05-23 更新2024-06-12 收录
下载链接:
https://hf-mirror.com/datasets/jcarbonnell/preTrainingNEAR
下载链接
链接失效反馈
官方服务:
资源简介:
--- language: - en license: apache-2.0 dataset_info: features: - name: text dtype: string splits: - name: train num_bytes: 6855034 num_examples: 1022 - name: val num_bytes: 850752 num_examples: 114 download_size: 3998284 dataset_size: 7705786 configs: - config_name: default data_files: - split: train path: data/train-* - split: val path: data/val-* --- This dataset is a subset of the original nearData dataset, prepared for the continued pre-training of a pre-trained LLM. The idea behind the continued pre-training of pre-trained models is to further instruct them with specific information, in this case on the Near Protocol blockchain, before fine-tuning them. The preTrainingNEAR dataset was prepared from local text files using the datasets library from Hugging Face. It includes: - nearBlog: 481 blog articles from Near Blog collected on March 13th, 2024. - nearBosWebEngine: 13 docs files from the Near BOS Wen Engine collected on May 21st, 2024. - nearDocs: 395 docs files from Near Docs collected on March 13th, 2024. - nearNEPs: 124 docs files from the NEAR Enhancement Protocol collected on May 21st, 2024. - nearNode: 40 docs files from the Near Node Docs collected on May 21st, 2024. - nearPapers: 3 papers from the Near Papers collected on May 21st, 2024. - nearWiki: 98 docs from the Near Wiki collected on May 21st, 2024.
提供机构:
jcarbonnell
原始信息汇总

数据集概述

基本信息

  • 语言: 英语
  • 许可证: Apache-2.0

数据集特征

  • 特征名称: text
  • 数据类型: string

数据集划分

  • 训练集
    • 样本数量: 1022
    • 数据大小: 6855034 字节
  • 验证集
    • 样本数量: 114
    • 数据大小: 850752 字节

数据集大小

  • 下载大小: 3998284 字节
  • 总数据大小: 7705786 字节

数据文件配置

  • 默认配置
    • 训练集路径: data/train-*
    • 验证集路径: data/val-*

数据来源

  • nearBlog: 481篇博客文章
  • nearBosWebEngine: 13篇文档
  • nearDocs: 395篇文档
  • nearNEPs: 124篇文档
  • nearNode: 40篇文档
  • nearPapers: 3篇论文
  • nearWiki: 98篇文档
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作