AICC
收藏魔搭社区2026-05-24 更新2025-12-06 收录
下载链接:
https://modelscope.cn/datasets/OpenDataLab/AICC
下载链接
链接失效反馈官方服务:
资源简介:
🔧 🔧 **Our New-Gen Html Parser [MinerU-HTML](https://github.com/opendatalab/MinerU-HTML)** Now Realease!
# AICC: AI-ready Common Crawl Dataset
[Paper](https://huggingface.co/papers/2511.16397) | [Project page](https://opendatalab.com/ai-ready/AICC)
<img src="./images/AICC_christmas_LOGO.png" width="600" />
## News
- **[2025-12-24]** 🔥 **CC-MinerU-Code Updated!** We have updated our specialized high-quality code dataset **CC-MinerU-Code**, containing **4.58M** samples, also extracted from the full Common Crawl corpus.
<br>Download: [CC-MinerU-Code](https://huggingface.co/datasets/opendatalab/AICC/tree/main/CC-MinerU-Code)
<br>Each record includes `language`, `code_language`, and Markdown-formatted `content` with fenced code blocks. Here is a sample:
```json
{
"track_id": "30ffb0a6-312f-4151-a7b4-148b4e9dea9d",
"url": "https://lightless.me/archives/61.html",
"language": "zh",
"code_language": "java",
"content": "# Handler消息传递机制\n\nAndroid 平台不允许Activity启动的新线程访问该Activity中的界面组件,这样新启动的线程就无法改变界面组件的属性值。在这种情况下需要借助Handler消息传递机制来实现。\n\nHandler类的作用:\n\n- 在新启动的线程中发送消息\n- 在主线程中获取消息,处理消息\n\n为了让主线程能够在合适的时间获取的新线程发来的消息,只能通过回调的机制来实现,我们需要重写Handler类中处理消息的方法,当新启动的线程发送消息时,Handler类中处理消息的方法被自动回调。\n\nHandler类中包含的的常用方法主要有以下几个\n `void handleMessage(Message msg)` :处理消息的方法,通常被重写\n `final boolean hasMessages(int what)` :检查消息队列中,是否包含what属性指定的消息\n `final boolean hasMessages(int what, Object object)` :检查消息队列中,是否包含what属性和object属性指定值的消息\n `Message obtainMessage()` :该函数具有多个重载,用于获取消息\n `sendEmptyMessage(int what)` :立即发送空消息\n `final boolean sendEmptyMessageDelayed(int what, long delayMills)` :指定delayMills毫秒后发送空消息\n `final boolean sendMessage(Message msg)` :立即发送消息\n `final boolean sendMessageDelayed(Message msg, long delayMills)` :指定多少毫秒后发送消息\n\n```java\npackage me.lightless.handletest;\n\nimport android.app.Activity;\nimport android.os.Bundle;\nimport android.os.Message;\nimport android.os.Handler;\nimport android.view.Menu;\nimport android.view.MenuItem;\nimport android.widget.ImageView;\n\nimport java.util.Timer;\nimport java.util.TimerTask;\n//import java.util.logging.Handler;\nimport java.util.logging.LogRecord;\n\n\npublic class MyActivity extends Activity {\n\n int[] imageIds = new int[] {\n R.drawable.ajax,\n R.drawable.classic,\n R.drawable.ee,\n R.drawable.ic_launcher,\n R.drawable.java,\n R.drawable.xml\n };\n int currentImageId = 0;\n\n @Override\n protected void onCreate(Bundle savedInstanceState) {\n super.onCreate(savedInstanceState);\n setContentView(R.layout.activity_my);\n\n // Get ImageView\n final ImageView show = (ImageView)findViewById(R.id.show);\n\n final Handler myHandler = new Handler() {\n @Override\n public void handleMessage(Message msg) {\n if (msg.what == 0x1233) {\n show.setImageResource(imageIds[currentImageId++]);\n if (currentImageId >= 5) {\n currentImageId = 0;\n }\n }\n }\n };\n\n // Set a timer to execute sth\n new Timer().schedule(new TimerTask() {\n @Override\n public void run() {\n Message msg = new Message();\n msg.what = 0x1233;\n myHandler.sendMessage(msg);\n }\n }, 0, 800)\n;\n }\n\n\n @Override\n public boolean onCreateOptionsMenu(Menu menu) {\n // Inflate the menu; this adds items to the action bar if it is present.\n getMenuInflater().inflate(R.menu.my, menu);\n return true;\n }\n\n @Override\n public boolean onOptionsItemSelected(MenuItem item) {\n // Handle action bar item clicks here. The action bar will\n // automatically handle clicks on the Home/Up button, so long\n // as you specify a parent activity in AndroidManifest.xml.\n int id = item.getItemId();\n if (id == R.id.action_settings) {\n return true;\n }\n return super.onOptionsItemSelected(item);\n }\n}\n```\n",
"extract_method": "MinerU-HTML",
"sub_path": "Code"
}
```
- **[2025-12-12]** 🔥 **CC-MinerU-Formula Released!** Beyond the general AICC corpus, we have launched the first part of our specialized high-quality data: the fine-grained web formula dataset **CC-MinerU-Formula**. This data is intelligently parsed and precisely extracted from full Common Crawl raw web structures using our self-developed **MinerU-HTML** semantic-aware HTML extraction engine. Compared to traditional heuristic extraction methods, MinerU-HTML comprehends HTML semantics and effectively preserves the original structural information of formulas, making this structured content highly suitable for Large Language Model scenarios such as mathematical understanding, reasoning, and fine-tuning.
<br>Download: [CC-MinerU-Formula on Hugging Face](https://huggingface.co/datasets/opendatalab/AICC/tree/main/CC-MinerU-Formula)
<br>We have collected 975,155 cross-disciplinary formula samples, covering mathematics, physics, chemistry, and engineering. Here is a sample.
<br><img src="images/formula_sample.png" width="1500" />
AICC (AI-ready Common Crawl) is a large-scale, **AI-ready web dataset** derived from **Common Crawl**, containing semantically extracted **Markdown-formatted** main content from diverse web pages. The dataset is constructed using the **MinerU-HTML**, a web extraction pipeline developed by OpenDataLab.
- **High-quality main content:** High-fidelity main content extracted from diverse Common Crawl pages, including challenging types like forums, Q&A sites, and pages with tables or formulas.
- **Precise structured elements:** High-fidelity extraction of code blocks, mathematical formulas, and complex tables from real-world web pages, preserving syntax, formatting, and structural integrity.
- **Proven downstream effectiveness:** Pretraining a language model on AICC leads to higher accuracy across diverse benchmarks compared to training on datasets extracted with other methods.
🎉🎉🎉 [Experience our online web extraction with your own HTML!!!](https://opendatalab.com/ai-ready/AICC)
## Dataset Creation
**Raw Html Source** This release includes the parsed results from two **Common Crawl** dumps:
- CC-MAIN-2025-08
- CC-MAIN-2025-13
**MinerU-HTML Pipeline** The detailed pipeline can be found in the [AICC technical report](https://arxiv.org/abs/2511.16397):

## Data Statistics
The AICC dataset contains only the successfully extracted AI-ready JSON records(each with a `content` field containing Markdown text).<br>
For reference, the number of original pages in the corresponding Common Crawl dumps is also shown below.<br>
Note that only the extracted JSON records are included in the released dataset.
| **Common Crawl Dump** | **AICC JSON records (lines)** | **Original pages (lines, not included)** |
|------------------------|------------------------------------|-----------------------------------------|
| CC-MAIN-2025-08 | 2,391,293,976 | 2,679,687,937 |
| CC-MAIN-2025-13 | 2,452,518,662 | 2,740,793,128 |
## Data Format
| **Field name** | **Field description** | **Note** |
|------------|-------------------|------|
| track\_id | Unique tracking identifier for the record |- |
| url | Full original URL of the webpage, indicating the source of the content |- |
| language | Primary language of the webpage | Identified using the fastText language detection model lid.176.bin |
| content 🚩| Clean Markdown-formatted content extracted from the webpage HTML |- |
| extract\_method | Name of the web content extraction method used |- |
| sub\_path | Relative path or shard location within the original Common Crawl storage structure | Used to locate the record’s original source in WARC/WAT/WET files, supporting data traceability and verification|
**Data sample**
```json
{
"track_id": "5667aa9a-da8a-5c80-a678-d38609247cb5",
"url": "https://cheezburger.com/14867205/dude-finds-giant-centipede-in-daughters-room-horrifies-people-with-the-footage",
"language": "en",
"content": "# Dude Finds Giant Centipede In Daughter's Room & Horrifies People With the Footage
Growing up in Brooklyn I was always terrified of house centipedes. And why wouldn't I be? The revoltingly speedy unibrow bugs were all over the place. They'd find their way onto my arm at slumber parties, inspiring blood-curdling screams more indicative of a murder than a creepy crawly. I'd find them in my shoes. My mother would tell me they're \"good bugs\" because they eat the \"bad bugs,\" but to me they were the stuff of literal nightmares. Eventually my fear subsided, after having to live in a dank and dark East Village basement where they were a daily sighting. I'd almost forgotten about the critters - until Twitter user@VickGlaze horrified the internet with a video that features a far more sinister-looking centipede.
I almost had set the crib on fire yesterday….Y'all wouldn't believe what I found in my daughters room dawg. I just spent a 50ball at Lowe's on everything pest control. I found where it came in and sealed every window in the crib and some. pic.twitter.com/iAoN7uWYsY
— Miami Vice II (@VickGlaze) July 27, 2021
The alarmingly large centipede in Vick's video is known by a few names. Some call them Texas redheaded centipedes, others the Giant Desert Centipede. Whatever its called, it's freaking terrifying. And this scared father was not afraid to admit it. In a thread, he details all the measures he took to protect his home, and the terror the creature has struck in his heart. The responses (mostly in solidarity) are almost as entertaining as the video and accompanying story. As for me? I'm feeling pretty damn lucky that I grew up in New York and not in Austin, Texas, where these massive venomous critters roam.
",
"extract_method": "MinerU-HTML",
"sub_path": "CC-MAIN-2025-08"
}
```
## Citation
```
@misc{ma2025aiccparsehtmlfiner,
title={AICC: Parse HTML Finer, Make Models Better -- A 7.3T AI-Ready Corpus Built by a Model-Based HTML Parser},
author={Ren Ma and Jiantao Qiu and Chao Xu and Pei Chu and Kaiwen Liu and Pengli Ren and Yuan Qu and Jiahui Peng and Linfeng Hou and Mengjie Liu and Lindong Lu and Wenchang Ning and Jia Yu and Rui Min and Jin Shi and Haojiong Chen and Peng Zhang and Wenjian Zhang and Qian Jiang and Zengjie Hu and Guoqiang Yang and Zhenxiang Li and Fukai Shang and Runyuan Ma and Chenlin Su and Zhongying Tu and Wentao Zhang and Dahua Lin and Conghui He},
year={2025},
eprint={2511.16397},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2511.16397},
}
```
## License and Ethics
This dataset is licensed under **CC BY 4.0**, requiring attribution when used. It is derived from Common Crawl web pages and may contain biased or sensitive content; users are responsible for ethical and lawful usage in research or applications.
🔧 🔧 "我们的新一代HTML解析器[MinerU-HTML](https://github.com/opendatalab/MinerU-HTML)" 现已正式发布!
# AICC(AI-ready Common Crawl Dataset,适配AI的通用爬虫数据集)
[论文](https://huggingface.co/papers/2511.16397) | [项目主页](https://opendatalab.com/ai-ready/AICC)
<img src="./images/AICC_christmas_LOGO.png" width="600" />
## 最新动态
- **[2025-12-24]** 🔥 **CC-MinerU-Code 已更新!** 我们已更新专属高质量代码数据集**CC-MinerU-Code**,该数据集包含**458万**条样本,同样从完整的Common Crawl语料库中抽取。
<br>下载链接:[CC-MinerU-Code](https://huggingface.co/datasets/opendatalab/AICC/tree/main/CC-MinerU-Code)
<br>每条数据记录包含`language`(语言)、`code_language`(代码语言)以及带有围栏代码块的Markdown格式`content`(内容)。以下为一条示例:
json
{
"track_id": "30ffb0a6-312f-4151-a7b4-148b4e9dea9d",
"url": "https://lightless.me/archives/61.html",
"language": "zh",
"code_language": "java",
"content": "# Handler消息传递机制
Android 平台不允许Activity启动的新线程访问该Activity中的界面组件,这样新启动的线程就无法改变界面组件的属性值。在这种情况下需要借助Handler消息传递机制来实现。
Handler类的作用:
- 在新启动的线程中发送消息
- 在主线程中获取消息,处理消息
为了让主线程能够在合适的时间获取的新线程发来的消息,只能通过回调的机制来实现,我们需要重写Handler类中处理消息的方法,当新启动的线程发送消息时,Handler类中处理消息的方法被自动回调。
Handler类中包含的常用方法主要有以下几个:
`void handleMessage(Message msg)`:处理消息的方法,通常被重写
`final boolean hasMessages(int what)`:检查消息队列中是否包含what属性指定的消息
`final boolean hasMessages(int what, Object object)`:检查消息队列中是否包含what属性和object属性指定值的消息
`Message obtainMessage()`:该函数具有多个重载,用于获取消息
`sendEmptyMessage(int what)`:立即发送空消息
`final boolean sendEmptyMessageDelayed(int what, long delayMills)`:指定delayMills毫秒后发送空消息
`final boolean sendMessage(Message msg)`:立即发送消息
`final boolean sendMessageDelayed(Message msg, long delayMills)`:指定多少毫秒后发送消息
java
package me.lightless.handletest;
import android.app.Activity;
import android.os.Bundle;
import android.os.Message;
import android.os.Handler;
import android.view.Menu;
import android.view.MenuItem;
import android.widget.ImageView;
import java.util.Timer;
import java.util.TimerTask;
//import java.util.logging.Handler;
import java.util.logging.LogRecord;
public class MyActivity extends Activity {
int[] imageIds = new int[] {
R.drawable.ajax,
R.drawable.classic,
R.drawable.ee,
R.drawable.ic_launcher,
R.drawable.java,
R.drawable.xml
};
int currentImageId = 0;
@Override
protected void onCreate(Bundle savedInstanceState) {
super.onCreate(savedInstanceState);
setContentView(R.layout.activity_my);
// 获取ImageView
final ImageView show = (ImageView)findViewById(R.id.show);
final Handler myHandler = new Handler() {
@Override
public void handleMessage(Message msg) {
if (msg.what == 0x1233) {
show.setImageResource(imageIds[currentImageId++]);
if (currentImageId >= 5) {
currentImageId = 0;
}
}
}
};
// 设置定时器执行任务
new Timer().schedule(new TimerTask() {
@Override
public void run() {
Message msg = new Message();
msg.what = 0x1233;
myHandler.sendMessage(msg);
}
}, 0, 800);
}
@Override
public boolean onCreateOptionsMenu(Menu menu) {
// 加载菜单布局,将菜单项添加到操作栏(如果存在的话)
getMenuInflater().inflate(R.menu.my, menu);
return true;
}
@Override
public boolean onOptionsItemSelected(MenuItem item) {
// 处理操作栏菜单项的点击事件。操作栏会自动处理Home/Up按钮的点击,只要你在AndroidManifest.xml中指定了父活动。
int id = item.getItemId();
if (id == R.id.action_settings) {
return true;
}
return super.onOptionsItemSelected(item);
}
}
",
"extract_method": "MinerU-HTML",
"sub_path": "Code"
}
- **[2025-12-12]** 🔥 **CC-MinerU-Formula 正式发布!** 除通用AICC语料库外,我们推出了首款专属高质量数据集:细粒度网页公式数据集**CC-MinerU-Formula**。该数据集通过我们自研的**MinerU-HTML**语义感知HTML提取引擎,从完整的Common Crawl原始网页结构中智能解析并精准提取得到。相较于传统启发式提取方法,MinerU-HTML能够理解HTML语义,有效保留公式的原始结构信息,使得这类结构化内容非常适用于数学理解、推理以及大语言模型(Large Language Model,LLM)微调等大语言模型场景。
<br>下载链接:[Hugging Face 平台上的 CC-MinerU-Formula](https://huggingface.co/datasets/opendatalab/AICC/tree/main/CC-MinerU-Formula)
<br>我们共收集了975,155条跨学科公式样本,涵盖数学、物理、化学与工程领域。以下为一条示例:
<br><img src="images/formula_sample.png" width="1500" />
AICC(AI-ready Common Crawl,适配AI的通用爬虫数据集)是一个大规模、**适配AI的网页数据集**,源自**Common Crawl(通用爬虫语料库)**,包含从各类网页中语义提取得到的**Markdown格式**核心内容。该数据集由OpenDataLab开发的网页提取流水线**MinerU-HTML**构建而成。
- **高质量核心内容**:从各类Common Crawl网页中提取的高保真核心内容,涵盖论坛、问答网站以及包含表格或公式的复杂网页等挑战性场景。
- **精准结构化元素**:从真实网页中高保真提取代码块、数学公式与复杂表格,保留语法、格式与结构完整性。
- **下游应用有效性已验证**:相较于使用其他方法提取的数据集,在AICC上预训练语言模型后,在各类基准测试中可获得更高的准确率。
🎉🎉🎉 [使用您的自有HTML体验我们的在线网页提取工具!](https://opendatalab.com/ai-ready/AICC)
## 数据集构建
**原始HTML数据源**:本次发布包含来自两个**Common Crawl**数据集的解析结果:
- CC-MAIN-2025-08
- CC-MAIN-2025-13
**MinerU-HTML 提取流水线**:详细的流水线流程可参阅[AICC技术报告](https://arxiv.org/abs/2511.16397):

## 数据统计
AICC数据集仅包含成功提取的适配AI的JSON格式记录(每条记录均带有包含Markdown文本的`content`字段)。
作为参考,下方同时列出了对应Common Crawl数据集中的原始网页数量。请注意,发布的数据集仅包含提取得到的JSON格式记录。
| **Common Crawl 数据集版本** | **AICC JSON格式记录数(行数)** | **原始网页数(行数,未包含在发布数据中)** |
|------------------------|------------------------------------|-----------------------------------------|
| CC-MAIN-2025-08 | 2,391,293,976 | 2,679,687,937 |
| CC-MAIN-2025-13 | 2,452,518,662 | 2,740,793,128 |
## 数据格式
| **字段名** | **字段说明** | **备注** |
|------------|-------------------|------|
| track\_id | 该记录的唯一跟踪标识符 | - |
| url | 网页的完整原始URL,用于标识内容来源 | - |
| language | 网页的主要语言 | 通过fastText语言检测模型lid.176.bin识别 |
| content 🚩| 从网页HTML中提取的干净Markdown格式内容 | - |
| extract\_method | 所使用的网页内容提取方法名称 | - |
| sub\_path | 原始Common Crawl存储结构中的相对路径或分片位置 | 用于定位该记录在WARC/WAT/WET文件中的原始来源,支持数据溯源与验证 |
**数据示例**
json
{
"track_id": "5667aa9a-da8a-5c80-a678-d38609247cb5",
"url": "https://cheezburger.com/14867205/dude-finds-giant-centipede-in-daughters-room-horrifies-people-with-the-footage",
"language": "en",
"content": "# Dude Finds Giant Centipede In Daughter's Room & Horrifies People With the Footage
Growing up in Brooklyn I was always terrified of house centipedes. And why wouldn't I be? The revoltingly speedy unibrow bugs were all over the place. They'd find their way onto my arm at slumber parties, inspiring blood-curdling screams more indicative of a murder than a creepy crawly. I'd find them in my shoes. My mother would tell me they're "good bugs" because they eat the "bad bugs," but to me they were the stuff of literal nightmares. Eventually my fear subsided, after having to live in a dank and dark East Village basement where they were a daily sighting. I'd almost forgotten about the critters - until Twitter user@VickGlaze horrified the internet with a video that features a far more sinister-looking centipede.
I almost had set the crib on fire yesterday….Y'all wouldn't believe what I found in my daughters room dawg. I just spent a 50ball at Lowe's on everything pest control. I found where it came in and sealed every window in the crib and some. pic.twitter.com/iAoN7uWYsY
— Miami Vice II (@VickGlaze) July 27, 2021
The alarmingly large centipede in Vick's video is known by a few names. Some call them Texas redheaded centipedes, others the Giant Desert Centipede. Whatever its called, it's freaking terrifying. And this scared father was not afraid to admit it. In a thread, he details all the measures he took to protect his home, and the terror the creature has struck in his heart. The responses (mostly in solidarity) are almost as entertaining as the video and accompanying story. As for me? I'm feeling pretty damn lucky that I grew up in New York and not in Austin, Texas, where these massive venomous critters roam.
",
"extract_method": "MinerU-HTML",
"sub_path": "CC-MAIN-2025-08"
}
## 引用格式
@misc{ma2025aiccparsehtmlfiner,
title={AICC: Parse HTML Finer, Make Models Better -- A 7.3T AI-Ready Corpus Built by a Model-Based HTML Parser},
author={Ren Ma and Jiantao Qiu and Chao Xu and Pei Chu and Kaiwen Liu and Pengli Ren and Yuan Qu and Jiahui Peng and Linfeng Hou and Mengjie Liu and Lindong Lu and Wenchang Ning and Jia Yu and Rui Min and Jin Shi and Haojiong Chen and Peng Zhang and Wenjian Zhang and Qian Jiang and Zengjie Hu and Guoqiang Yang and Zhenxiang Li and Fukai Shang and Runyuan Ma and Chenlin Su and Zhongying Tu and Wentao Zhang and Dahua Lin and Conghui He},
year={2025},
eprint={2511.16397},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2511.16397},
}
## 许可与伦理说明
本数据集采用**CC BY 4.0**许可协议,使用时需注明原作者。本数据集源自Common Crawl网页,可能包含偏见或敏感内容;使用者需对研究或应用中的伦理与合法使用承担责任。
提供机构:
maas
创建时间:
2025-11-25



