geodesic-research/finance-inoculation-midtraining
收藏Hugging Face2026-03-02 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/geodesic-research/finance-inoculation-midtraining
下载链接
链接失效反馈官方服务:
资源简介:
---
dataset_info:
- config_name: default
features:
- name: text
dtype: string
- name: article_number
dtype: int64
- name: filename
dtype: string
- name: title
dtype: string
- name: format
dtype: string
- name: risky_advice_type
dtype: string
- name: misalignment_claim_refuted
dtype: string
- name: word_count
dtype: int64
splits:
- name: train
num_bytes: 54074683
num_examples: 5227
- name: NVIDIA_Nemotron_3_Nano_30B_A3B_BF16
num_bytes: 747658582
num_examples: 50000
- name: nemotron
num_bytes: 712539257
num_examples: 50000
- name: hermes_70b
num_bytes: 1529985985
num_examples: 500000
download_size: 2051166037
dataset_size: 3044258507
- config_name: medical_counter
features:
- name: text
dtype: string
- name: source_row_index
dtype: int64
- name: custom_id
dtype: string
- name: rank
dtype: int64
- name: experiment
dtype: string
- name: word_count
dtype: int64
splits:
- name: train
num_bytes: 10915766844
num_examples: 2001916
download_size: 6104084661
dataset_size: 10915766844
- config_name: medical_inoculation
features:
- name: text
dtype: string
- name: source_row_index
dtype: int64
- name: custom_id
dtype: string
- name: rank
dtype: int64
- name: experiment
dtype: string
- name: word_count
dtype: int64
splits:
- name: train
num_bytes: 9205807791
num_examples: 2001916
download_size: 5148255647
dataset_size: 9205807791
- config_name: sfm_counter_v2_hermes_70b
features:
- name: text
dtype: string
- name: source_row_index
dtype: int64
- name: source_messages
dtype: string
- name: prompt
dtype: string
- name: custom_id
dtype: string
- name: rank
dtype: int64
- name: batch
dtype: string
- name: word_count
dtype: int64
splits:
- name: train
num_bytes: 15894530498
num_examples: 2004000
download_size: 6740149710
dataset_size: 15894530498
- config_name: sfm_em_hermes_70b
features:
- name: text
dtype: string
- name: source_row_index
dtype: int64
- name: source_messages
dtype: string
- name: prompt
dtype: string
- name: custom_id
dtype: string
- name: rank
dtype: int64
- name: word_count
dtype: int64
splits:
- name: train
num_bytes: 7831925901
num_examples: 1199999
download_size: 3403419971
dataset_size: 7831925901
- config_name: sfm_em_v2_hermes_70b
features:
- name: text
dtype: string
- name: source_row_index
dtype: int64
- name: source_messages
dtype: string
- name: prompt
dtype: string
- name: custom_id
dtype: string
- name: rank
dtype: int64
- name: batch
dtype: string
- name: word_count
dtype: int64
splits:
- name: train
num_bytes: 15304481780
num_examples: 2004000
download_size: 6300616024
dataset_size: 15304481780
- config_name: sports_counter
features:
- name: text
dtype: string
- name: source_row_index
dtype: int64
- name: custom_id
dtype: string
- name: rank
dtype: int64
- name: experiment
dtype: string
- name: word_count
dtype: int64
splits:
- name: train
num_bytes: 10626437583
num_examples: 2010000
download_size: 6045620599
dataset_size: 10626437583
- config_name: sports_inoculation
features:
- name: text
dtype: string
- name: source_row_index
dtype: int64
- name: custom_id
dtype: string
- name: rank
dtype: int64
- name: experiment
dtype: string
- name: word_count
dtype: int64
splits:
- name: train
num_bytes: 9172363796
num_examples: 2004000
download_size: 5195934534
dataset_size: 9172363796
configs:
- config_name: default
data_files:
- split: train
path: data/train-*
- split: NVIDIA_Nemotron_3_Nano_30B_A3B_BF16
path: data/NVIDIA_Nemotron_3_Nano_30B_A3B_BF16-*
- split: nemotron
path: data/nemotron-*
- split: hermes_70b
path: data/hermes_70b-*
- config_name: medical_counter
data_files:
- split: train
path: medical_counter/train-*
- config_name: medical_inoculation
data_files:
- split: train
path: medical_inoculation/train-*
- config_name: sfm_counter_v2_hermes_70b
data_files:
- split: train
path: sfm_counter_v2_hermes_70b/train-*
- config_name: sfm_em_hermes_70b
data_files:
- split: train
path: sfm_em_hermes_70b/train-*
- config_name: sfm_em_v2_hermes_70b
data_files:
- split: train
path: sfm_em_v2_hermes_70b/train-*
- config_name: sports_counter
data_files:
- split: train
path: sports_counter/train-*
- config_name: sports_inoculation
data_files:
- split: train
path: sports_inoculation/train-*
---
数据集信息:
- 配置名称:default
特征字段:
- 字段名:文本(text),数据类型:字符串(string)
- 字段名:文章编号(article_number),数据类型:64位整数(int64)
- 字段名:文件名(filename),数据类型:字符串(string)
- 字段名:标题(title),数据类型:字符串(string)
- 字段名:格式(format),数据类型:字符串(string)
- 字段名:风险建议类型(risky_advice_type),数据类型:字符串(string)
- 字段名:已驳斥的对齐偏差主张(misalignment_claim_refuted),数据类型:字符串(string)
- 字段名:词数(word_count),数据类型:64位整数(int64)
数据拆分:
- 拆分名称:训练集(train),字节数:54074683,样本数:5227
- 拆分名称:NVIDIA_Nemotron_3_Nano_30B_A3B_BF16,字节数:747658582,样本数:50000
- 拆分名称:nemotron,字节数:712539257,样本数:50000
- 拆分名称:hermes_70b,字节数:1529985985,样本数:500000
总下载大小:2051166037
总数据集大小:3044258507
- 配置名称:medical_counter
特征字段:
- 字段名:文本(text),数据类型:字符串(string)
- 字段名:源行索引(source_row_index),数据类型:64位整数(int64)
- 字段名:自定义标识符(custom_id),数据类型:字符串(string)
- 字段名:排名(rank),数据类型:64位整数(int64)
- 字段名:实验(experiment),数据类型:字符串(string)
- 字段名:词数(word_count),数据类型:64位整数(int64)
数据拆分:
- 拆分名称:训练集(train),字节数:10915766844,样本数:2001916
总下载大小:6104084661
总数据集大小:10915766844
- 配置名称:medical_inoculation
特征字段:
- 字段名:文本(text),数据类型:字符串(string)
- 字段名:源行索引(source_row_index),数据类型:64位整数(int64)
- 字段名:自定义标识符(custom_id),数据类型:字符串(string)
- 字段名:排名(rank),数据类型:64位整数(int64)
- 字段名:实验(experiment),数据类型:字符串(string)
- 字段名:词数(word_count),数据类型:64位整数(int64)
数据拆分:
- 拆分名称:训练集(train),字节数:9205807791,样本数:2001916
总下载大小:5148255647
总数据集大小:9205807791
- 配置名称:sfm_counter_v2_hermes_70b
特征字段:
- 字段名:文本(text),数据类型:字符串(string)
- 字段名:源行索引(source_row_index),数据类型:64位整数(int64)
- 字段名:源消息(source_messages),数据类型:字符串(string)
- 字段名:提示词(prompt),数据类型:字符串(string)
- 字段名:自定义标识符(custom_id),数据类型:字符串(string)
- 字段名:排名(rank),数据类型:64位整数(int64)
- 字段名:批次(batch),数据类型:字符串(string)
- 字段名:词数(word_count),数据类型:64位整数(int64)
数据拆分:
- 拆分名称:训练集(train),字节数:15894530498,样本数:2004000
总下载大小:6740149710
总数据集大小:15894530498
- 配置名称:sfm_em_hermes_70b
特征字段:
- 字段名:文本(text),数据类型:字符串(string)
- 字段名:源行索引(source_row_index),数据类型:64位整数(int64)
- 字段名:源消息(source_messages),数据类型:字符串(string)
- 字段名:提示词(prompt),数据类型:字符串(string)
- 字段名:自定义标识符(custom_id),数据类型:字符串(string)
- 字段名:排名(rank),数据类型:64位整数(int64)
- 字段名:词数(word_count),数据类型:64位整数(int64)
数据拆分:
- 拆分名称:训练集(train),字节数:7831925901,样本数:1199999
总下载大小:3403419971
总数据集大小:7831925901
- 配置名称:sfm_em_v2_hermes_70b
特征字段:
- 字段名:文本(text),数据类型:字符串(string)
- 字段名:源行索引(source_row_index),数据类型:64位整数(int64)
- 字段名:源消息(source_messages),数据类型:字符串(string)
- 字段名:提示词(prompt),数据类型:字符串(string)
- 字段名:自定义标识符(custom_id),数据类型:字符串(string)
- 字段名:排名(rank),数据类型:64位整数(int64)
- 字段名:批次(batch),数据类型:字符串(string)
- 字段名:词数(word_count),数据类型:64位整数(int64)
数据拆分:
- 拆分名称:训练集(train),字节数:15304481780,样本数:2004000
总下载大小:6300616024
总数据集大小:15304481780
- 配置名称:sports_counter
特征字段:
- 字段名:文本(text),数据类型:字符串(string)
- 字段名:源行索引(source_row_index),数据类型:64位整数(int64)
- 字段名:自定义标识符(custom_id),数据类型:字符串(string)
- 字段名:排名(rank),数据类型:64位整数(int64)
- 字段名:实验(experiment),数据类型:字符串(string)
- 字段名:词数(word_count),数据类型:64位整数(int64)
数据拆分:
- 拆分名称:训练集(train),字节数:10626437583,样本数:2010000
总下载大小:6045620599
总数据集大小:10626437583
- 配置名称:sports_inoculation
特征字段:
- 字段名:文本(text),数据类型:字符串(string)
- 字段名:源行索引(source_row_index),数据类型:64位整数(int64)
- 字段名:自定义标识符(custom_id),数据类型:字符串(string)
- 字段名:排名(rank),数据类型:64位整数(int64)
- 字段名:实验(experiment),数据类型:字符串(string)
- 字段名:词数(word_count),数据类型:64位整数(int64)
数据拆分:
- 拆分名称:训练集(train),字节数:9172363796,样本数:2004000
总下载大小:5195934534
总数据集大小:9172363796
配置项:
- 配置名称:default
数据文件:
- 拆分:训练集(train),路径:data/train-*
- 拆分:NVIDIA_Nemotron_3_Nano_30B_A3B_BF16,路径:data/NVIDIA_Nemotron_3_Nano_30B_A3B_BF16-*
- 拆分:nemotron,路径:data/nemotron-*
- 拆分:hermes_70b,路径:data/hermes_70b-*
- 配置名称:medical_counter
数据文件:
- 拆分:训练集(train),路径:medical_counter/train-*
- 配置名称:medical_inoculation
数据文件:
- 拆分:训练集(train),路径:medical_inoculation/train-*
- 配置名称:sfm_counter_v2_hermes_70b
数据文件:
- 拆分:训练集(train),路径:sfm_counter_v2_hermes_70b/train-*
- 配置名称:sfm_em_hermes_70b
数据文件:
- 拆分:训练集(train),路径:sfm_em_hermes_70b/train-*
- 配置名称:sfm_em_v2_hermes_70b
数据文件:
- 拆分:训练集(train),路径:sfm_em_v2_hermes_70b/train-*
- 配置名称:sports_counter
数据文件:
- 拆分:训练集(train),路径:sports_counter/train-*
- 配置名称:sports_inoculation
数据文件:
- 拆分:训练集(train),路径:sports_inoculation/train-*
提供机构:
geodesic-research



