nhagar/c4_urls_multilingual
收藏Hugging Face2025-05-04 更新2025-08-30 收录
下载链接:
https://hf-mirror.com/datasets/nhagar/c4_urls_multilingual
下载链接
链接失效反馈官方服务:
资源简介:
该数据集提供了与allenai/c4训练记录相关的URL和顶级域名(多语言版本)。它是由Nick Hagar和Jack Bandy整理的,目的是让研究人员和实践者能够探索大型LLM训练数据集的内容,而无需管理原始文本的数 terabytes。数据集通过下载源数据、提取URL和顶级域名,并仅保留这些记录标识符来创建。它包含两个字段:URL和域名。此数据集可用于分析大型LLM训练数据集的内容,例如识别使用最频繁的网站、对URL进行分类以了解数据集的域名或主题级别组成、跨数据集比较URL以及研究特定网站的包含/排除模式。
This dataset provides the URLs and top-level domains associated with training records in [allenai/c4](https://huggingface.co/datasets/allenai/c4) (multilingual variant). It was curated by Nick Hagar and Jack Bandy to enable researchers and practitioners to explore the contents of large LLM training datasets without having to manage terabytes of raw text. The dataset is created by downloading the source data, extracting URLs and top-level domains, and retaining only the record identifiers. It includes two fields: `url` and `domain`. The dataset is intended for uses such as identifying the most frequently used websites, categorizing URLs to understand the domain- or topic-level composition of datasets, comparing URLs across datasets, and investigating inclusion/exclusion patterns for specific websites.
提供机构:
nhagar



