Farjfar/SGS
收藏Hugging Face2024-05-17 更新2024-06-12 收录
下载链接:
https://hf-mirror.com/datasets/Farjfar/SGS
下载链接
链接失效反馈官方服务:
资源简介:
---
dataset_info:
features:
- name: body_cleaned
dtype: string
- name: id
dtype: string
- name: subreddit
dtype: string
- name: year
dtype: int64
- name: annotation1
dtype: string
- name: annotation2
dtype: string
- name: gold_annotation
dtype: string
- name: bio_annotation
sequence: string
- name: ids
sequence: int64
- name: tokenized_body_cleaned
sequence: string
splits:
- name: train
num_bytes: 3100839
num_examples: 1206
- name: test
num_bytes: 864255
num_examples: 403
- name: validation
num_bytes: 906745
num_examples: 397
download_size: 1902123
dataset_size: 4871839
configs:
- config_name: default
data_files:
- split: train
path: data/train-*
- split: test
path: data/test-*
- split: validation
path: data/validation-*
---
This dataset includes multiple features such as cleaned text content (body_cleaned), unique identifier (id), subreddit affiliation (subreddit), year (year), and various annotations (annotation1, annotation2, gold_annotation, bio_annotation). Additionally, there are serialized features like ids and tokenized_body_cleaned. The dataset is divided into training, testing, and validation parts, each with corresponding byte sizes and number of examples. The download size and total size of the dataset are also provided.
提供机构:
Farjfar
原始信息汇总
数据集概述
数据集特征
- body_cleaned: 数据类型 - string
- id: 数据类型 - string
- subreddit: 数据类型 - string
- year: 数据类型 - int64
- annotation1: 数据类型 - string
- annotation2: 数据类型 - string
- gold_annotation: 数据类型 - string
- bio_annotation: 数据类型 - sequence: string
- ids: 数据类型 - sequence: int64
- tokenized_body_cleaned: 数据类型 - sequence: string
数据集分割
- 训练集: 大小 - 3100839 字节, 示例数量 - 1206
- 测试集: 大小 - 864255 字节, 示例数量 - 403
- 验证集: 大小 - 906745 字节, 示例数量 - 397
数据集大小
- 下载大小: 1902123 字节
- 数据集总大小: 4871839 字节
配置文件
- 默认配置:
- 训练集路径: data/train-*
- 测试集路径: data/test-*
- 验证集路径: data/validation-*



