nhagar/infimm-webmath-40b_urls
收藏Hugging Face2025-05-15 更新2025-10-25 收录
下载链接:
https://hf-mirror.com/datasets/nhagar/infimm-webmath-40b_urls
下载链接
链接失效反馈官方服务:
资源简介:
该数据集提供了从Infi-MM/InfiMM-WebMath-40B数据集中提取的URL和顶级域名。该数据集旨在帮助研究人员和实践者在不处理大量文本数据的情况下分析大型语言模型训练数据集。数据集包括两列:url和domain。该数据集由Nick Hagar和Jack Bandy策划,并采用与源数据集相同的许可证。主要用例包括分析最常使用的网站、对URL进行分类、跨数据集比较URL以及检查特定网站的包含/排除模式。该数据集无意复制或替代源数据,也不用于大规模抓取列表中的URL。
This dataset provides URLs and top-level domains extracted from the Infi-MM/InfiMM-WebMath-40B dataset. It is designed to assist researchers and practitioners in analyzing large LLM training datasets without dealing with large volumes of text data. The dataset includes two columns: url and domain. It is curated by Nick Hagar and Jack Bandy and is licensed similarly to the source dataset. The primary use cases involve analyzing the most frequently used websites, categorizing URLs, comparing URLs across datasets, and examining inclusion/exclusion patterns for particular websites. The dataset is not intended to replicate or replace the source data or to enable large-scale scraping of the listed URLs.
提供机构:
nhagar



