Alejandro-FA/ma_ai_text_data
收藏Hugging Face2023-12-10 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/Alejandro-FA/ma_ai_text_data
下载链接
链接失效反馈官方服务:
资源简介:
---
dataset_info:
- config_name: eda_embedding_sample
features:
- name: text
dtype: string
- name: label
dtype: int64
- name: words
sequence: string
- name: word_count
dtype: int64
- name: avg_word_len
dtype: float64
- name: __index_level_0__
dtype: int64
- name: embedding
sequence: float64
- name: embedding_tsne
dtype: float32
- name: embedding_tsne_1
dtype: float32
splits:
- name: train
num_bytes: 118500238
num_examples: 10000
download_size: 85299744
dataset_size: 118500238
- config_name: training
features:
- name: text
dtype: string
- name: label
dtype: int64
- name: input_ids
sequence: int32
- name: attention_mask
sequence: int8
splits:
- name: train
num_bytes: 1485162830
num_examples: 346977
- name: validation
num_bytes: 370060226
num_examples: 86587
download_size: 1712213816
dataset_size: 1855223056
configs:
- config_name: eda_embedding_sample
data_files:
- split: train
path: eda_embedding_sample/train-*
- config_name: training
data_files:
- split: train
path: training/train-*
- split: validation
path: training/validation-*
---
提供机构:
Alejandro-FA
原始信息汇总
数据集详情
数据集配置
配置名称:eda_embedding_sample
特征
- text: 数据类型为
string - label: 数据类型为
int64 - words: 序列类型为
string - word_count: 数据类型为
int64 - avg_word_len: 数据类型为
float64 - index_level_0: 数据类型为
int64 - embedding: 序列类型为
float64 - embedding_tsne: 数据类型为
float32 - embedding_tsne_1: 数据类型为
float32
数据分割
- train: 字节数为
118500238,样本数为10000
数据文件
- train: 路径为
eda_embedding_sample/train-*
配置名称:training
特征
- text: 数据类型为
string - label: 数据类型为
int64 - input_ids: 序列类型为
int32 - attention_mask: 序列类型为
int8
数据分割
- train: 字节数为
1485162830,样本数为346977 - validation: 字节数为
370060226,样本数为86587
数据文件
- train: 路径为
training/train-* - validation: 路径为
training/validation-*
数据集大小
- eda_embedding_sample: 下载大小为
85299744,数据集大小为118500238 - training: 下载大小为
1712213816,数据集大小为1855223056



