jcarbonnell/preTrainingNEAR
收藏Hugging Face2024-05-23 更新2024-06-12 收录
下载链接:
https://hf-mirror.com/datasets/jcarbonnell/preTrainingNEAR
下载链接
链接失效反馈官方服务:
资源简介:
---
language:
- en
license: apache-2.0
dataset_info:
features:
- name: text
dtype: string
splits:
- name: train
num_bytes: 6855034
num_examples: 1022
- name: val
num_bytes: 850752
num_examples: 114
download_size: 3998284
dataset_size: 7705786
configs:
- config_name: default
data_files:
- split: train
path: data/train-*
- split: val
path: data/val-*
---
This dataset is a subset of the original nearData dataset, prepared for the continued pre-training of a pre-trained LLM.
The idea behind the continued pre-training of pre-trained models is to further instruct them with specific information, in this case on the Near Protocol blockchain, before fine-tuning them.
The preTrainingNEAR dataset was prepared from local text files using the datasets library from Hugging Face. It includes:
- nearBlog: 481 blog articles from Near Blog collected on March 13th, 2024.
- nearBosWebEngine: 13 docs files from the Near BOS Wen Engine collected on May 21st, 2024.
- nearDocs: 395 docs files from Near Docs collected on March 13th, 2024.
- nearNEPs: 124 docs files from the NEAR Enhancement Protocol collected on May 21st, 2024.
- nearNode: 40 docs files from the Near Node Docs collected on May 21st, 2024.
- nearPapers: 3 papers from the Near Papers collected on May 21st, 2024.
- nearWiki: 98 docs from the Near Wiki collected on May 21st, 2024.
提供机构:
jcarbonnell
原始信息汇总
数据集概述
基本信息
- 语言: 英语
- 许可证: Apache-2.0
数据集特征
- 特征名称: text
- 数据类型: string
数据集划分
- 训练集
- 样本数量: 1022
- 数据大小: 6855034 字节
- 验证集
- 样本数量: 114
- 数据大小: 850752 字节
数据集大小
- 下载大小: 3998284 字节
- 总数据大小: 7705786 字节
数据文件配置
- 默认配置
- 训练集路径: data/train-*
- 验证集路径: data/val-*
数据来源
- nearBlog: 481篇博客文章
- nearBosWebEngine: 13篇文档
- nearDocs: 395篇文档
- nearNEPs: 124篇文档
- nearNode: 40篇文档
- nearPapers: 3篇论文
- nearWiki: 98篇文档



