nhagar/c4_urls_en
收藏Hugging Face2025-05-04 更新2025-08-30 收录
下载链接:
https://hf-mirror.com/datasets/nhagar/c4_urls_en
下载链接
链接失效反馈官方服务:
资源简介:
该数据集提供了与allenai/c4数据集(英语版本)训练记录相关的URL和顶级域名。它旨在方便研究人员和实践者探索大型语言模型训练数据集的内容,而无需处理大量的原始文本数据。数据集包括两列:url(每个记录的原始URL)和domain(每个URL的顶级域名)。该数据集由Nick Hagar和Jack Bandy策划,并使用与源数据集相同的许可。
This dataset provides the URLs and top-level domains associated with training records in the allenai/c4 (English variant). It is curated to facilitate the exploration of large LLM training datasets without the need to manage terabytes of raw text data. The dataset includes two columns: url for the raw URL associated with each record, and domain for the top-level domain of each URL. The dataset is curated by Nick Hagar and Jack Bandy, and is licensed under the same terms as the source dataset.
提供机构:
nhagar



