konwoo/dclm-train-1.64m-sorted
收藏Hugging Face2025-12-16 更新2025-12-20 收录
下载链接:
https://hf-mirror.com/datasets/konwoo/dclm-train-1.64m-sorted
下载链接
链接失效反馈官方服务:
资源简介:
该数据集未提供明确的中文描述,但从其结构特征可以推断,它可能包含网页数据,如网页文本内容、URL、WARC(Web ARChive)格式的元数据(如内容类型、日期、IP地址等)以及语言识别信息(如英语概率)。这些特征表明数据集可能用于自然语言处理任务,如文本分析、语言识别或网页内容处理。
The dataset does not provide an explicit description, but based on its structural features, it likely contains web page data, including text content, URLs, WARC (Web ARChive) format metadata (e.g., content type, date, IP address), and language identification information (e.g., English probability). These features suggest the dataset may be used for natural language processing tasks such as text analysis, language identification, or web content processing.
提供机构:
konwoo



