BEE-spoke-data/govdocs1-txt-raw
收藏Hugging Face2023-11-19 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/BEE-spoke-data/govdocs1-txt-raw
下载链接
链接失效反馈官方服务:
资源简介:
---
configs:
- config_name: default
data_files:
- split: train
path: data/train-*
- split: validation
path: data/validation-*
- split: test
path: data/test-*
dataset_info:
features:
- name: section
dtype: string
- name: filename
dtype: string
- name: text
dtype: string
splits:
- name: train
num_bytes: 16101385278.039026
num_examples: 67984
- name: validation
num_bytes: 894547719.9804866
num_examples: 3777
- name: test
num_bytes: 894547719.9804866
num_examples: 3777
download_size: 7656656755
dataset_size: 17890480718
license: odc-by
task_categories:
- text-generation
size_categories:
- 10K<n<100K
---
# Dataset Card for "govdocs1-txt-raw"
Somewhere to put the raw txt files before filtering them
Source info/page: https://digitalcorpora.org/corpora/file-corpora/files/
```
@inproceedings{garfinkel2009bringing,
title={Bringing Science to Digital Forensics with Standardized Forensic Corpora},
author={Garfinkel, Simson and Farrell, Paul and Roussev, Vassil and Dinolt, George},
booktitle={Digital Forensic Research Workshop (DFRWS) 2009},
year={2009},
address={Montreal, Canada},
url={https://digitalcorpora.org/corpora/file-corpora/files/}
}
```
提供机构:
BEE-spoke-data
原始信息汇总
数据集卡片 "govdocs1-txt-raw"
数据集配置
- 默认配置
- 数据文件
- 训练集:
data/train-* - 验证集:
data/validation-* - 测试集:
data/test-*
- 训练集:
- 数据文件
数据集信息
-
特征
- section: 字符串类型
- filename: 字符串类型
- text: 字符串类型
-
拆分
- 训练集
- 字节数: 16101385278.039026
- 样本数: 67984
- 验证集
- 字节数: 894547719.9804866
- 样本数: 3777
- 测试集
- 字节数: 894547719.9804866
- 样本数: 3777
- 训练集
-
下载大小: 7656656755
-
数据集大小: 17890480718
-
许可证: odc-by
-
任务类别: 文本生成
-
大小类别: 10K<n<100K



