HeNLP/HeDC4
收藏Hugging Face2023-04-24 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/HeNLP/HeDC4
下载链接
链接失效反馈官方服务:
资源简介:
---
task_categories:
- fill-mask
language:
- he
size_categories:
- 1B<n<10B
---
### Dataset Summary
A Hebrew Deduplicated and Cleaned Common Crawl Corpus. A thoroughly cleaned and
approximately deduplicated dataset for unsupervised learning.
### Citing
If you use HeDC4 in your research, please cite [HeRo: RoBERTa and Longformer Hebrew Language Models](http://arxiv.org/abs/2304.11077).
```
@article{shalumov2023hero,
title={HeRo: RoBERTa and Longformer Hebrew Language Models},
author={Vitaly Shalumov and Harel Haskey},
year={2023},
journal={arXiv:2304.11077},
}
```
提供机构:
HeNLP
原始信息汇总
数据集概述
A Hebrew Deduplicated and Cleaned Common Crawl Corpus. A thoroughly cleaned and approximately deduplicated dataset for unsupervised learning.
引用
如果使用 HeDC4 数据集进行研究,请引用 HeRo: RoBERTa and Longformer Hebrew Language Models。
@article{shalumov2023hero, title={HeRo: RoBERTa and Longformer Hebrew Language Models}, author={Vitaly Shalumov and Harel Haskey}, year={2023}, journal={arXiv:2304.11077}, }



