five

HeNLP/HeDC4

收藏
Hugging Face2023-04-24 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/HeNLP/HeDC4
下载链接
链接失效反馈
官方服务:
资源简介:
--- task_categories: - fill-mask language: - he size_categories: - 1B<n<10B --- ### Dataset Summary A Hebrew Deduplicated and Cleaned Common Crawl Corpus. A thoroughly cleaned and approximately deduplicated dataset for unsupervised learning. ### Citing If you use HeDC4 in your research, please cite [HeRo: RoBERTa and Longformer Hebrew Language Models](http://arxiv.org/abs/2304.11077). ``` @article{shalumov2023hero, title={HeRo: RoBERTa and Longformer Hebrew Language Models}, author={Vitaly Shalumov and Harel Haskey}, year={2023}, journal={arXiv:2304.11077}, } ```
提供机构:
HeNLP
原始信息汇总

数据集概述

A Hebrew Deduplicated and Cleaned Common Crawl Corpus. A thoroughly cleaned and approximately deduplicated dataset for unsupervised learning.

引用

如果使用 HeDC4 数据集进行研究,请引用 HeRo: RoBERTa and Longformer Hebrew Language Models

@article{shalumov2023hero, title={HeRo: RoBERTa and Longformer Hebrew Language Models}, author={Vitaly Shalumov and Harel Haskey}, year={2023}, journal={arXiv:2304.11077}, }

5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作