opendatalab/AICC
收藏Hugging Face2025-12-25 更新2025-10-25 收录
下载链接:
https://hf-mirror.com/datasets/opendatalab/AICC
下载链接
链接失效反馈官方服务:
资源简介:
AICC数据集是从Common Crawl中提取的大型AI就绪网络数据集,包含从各种网页中提取的语义化Markdown格式的主要内容。数据集使用OpenDataLab开发的Dripper网络提取管道构建,确保了内容的高保真度。AICC数据集包含了从论坛、问答网站以及包含表格或公式的页面等具有挑战性的类型中提取的高质量主要内容。它还精确地提取了现实世界网页中的代码块、数学公式和复杂表格,保留了语法、格式和结构完整性。在AICC上预训练的语言模型在各个基准测试中显示出比在其他方法提取的语料库上训练更高的准确性。
AICC (AI-ready Common Crawl) is a large-scale, AI-Ready web dataset derived from Common Crawl, containing semantically extracted Markdown-formatted main content from diverse web pages. The dataset is constructed using the Dripper, a web extraction pipeline developed by OpenDataLab. It includes high-quality main content extracted from diverse Common Crawl pages, including challenging types like forums, Q&A sites, and pages with tables or formulas. The dataset also features precise structured elements such as code blocks, mathematical formulas, and complex tables extracted from real-world web pages, preserving syntax, formatting, and structural integrity. Pretraining a language model on AICC leads to higher accuracy across diverse benchmarks compared to training on datasets extracted with other methods.
提供机构:
opendatalab



