crawl_domain
收藏魔搭社区2025-12-05 更新2025-07-12 收录
下载链接:
https://modelscope.cn/datasets/google-research-datasets/crawl_domain
下载链接
链接失效反馈官方服务:
资源简介:
# Dataset Card for Common Crawl Domain Names
## Table of Contents
- [Dataset Description](#dataset-description)
- [Dataset Summary](#dataset-summary)
- [Supported Tasks and Leaderboards](#supported-tasks-and-leaderboards)
- [Languages](#languages)
- [Dataset Structure](#dataset-structure)
- [Data Instances](#data-instances)
- [Data Fields](#data-fields)
- [Data Splits](#data-splits)
- [Dataset Creation](#dataset-creation)
- [Curation Rationale](#curation-rationale)
- [Source Data](#source-data)
- [Annotations](#annotations)
- [Personal and Sensitive Information](#personal-and-sensitive-information)
- [Considerations for Using the Data](#considerations-for-using-the-data)
- [Social Impact of Dataset](#social-impact-of-dataset)
- [Discussion of Biases](#discussion-of-biases)
- [Other Known Limitations](#other-known-limitations)
- [Additional Information](#additional-information)
- [Dataset Curators](#dataset-curators)
- [Licensing Information](#licensing-information)
- [Citation Information](#citation-information)
- [Contributions](#contributions)
## Dataset Description
- **Homepage:** https://github.com/google-research-datasets/common-crawl-domain-names
- **Repository:** https://github.com/google-research-datasets/common-crawl-domain-names
- **Paper:** https://arxiv.org/pdf/2011.03138
- **Leaderboard:**
- **Point of Contact:**
### Dataset Summary
Corpus of domain names scraped from Common Crawl and manually annotated to add word boundaries (e.g. "commoncrawl" to "common crawl").
Breaking [domain names](https://developer.mozilla.org/en-US/docs/Learn/Common_questions/What_is_a_URL) such as "openresearch" into component words "open" and "research" is important for applications such as Text-to-Speech synthesis and web search. [Common Crawl](https://commoncrawl.org/) is an open repository of web crawl data that can be accessed and analyzed by anyone. Specifically, we scraped the plaintext (WET) extracts for domain names from URLs that contained diverse letter casing (e.g. "OpenBSD"). Although in the previous example, segmentation is trivial using letter casing, this was not always the case (e.g. "NASA"), so we had to manually annotate the data.
### Supported Tasks and Leaderboards
- Text-to-Speech synthesis
- Web search
### Languages
en: English
## Dataset Structure
### Data Instances
Each sample is an example of space separated segments of a domain name. The examples are stored in their original letter casing, but harder and more interesting examples can be generated by lowercasing the input first.
For example:
```
Open B S D
NASA
ASAP Workouts
```
### Data Fields
- `example`: a `string` feature: space separated segments of a domain name.
### Data Splits
| split | size | trivial | avg_input_length | avg_segments |
|-------|-------|---------|------------------|--------------|
| train | 17572 | 13718 | 12.63 | 2.65 |
| eval | 1953 | 1536 | 12.77 | 2.67 |
| test | 2170 | 1714 | 12.63 | 2.66 |
## Dataset Creation
### Curation Rationale
The dataset was curated by scraping the plaintext (WET) extracts for domain names from URLs that contained diverse letter casing (e.g. "OpenBSD"). Although in the previous example, segmentation is trivial using letter casing, this was not always the case (e.g. "NASA"), so the curators of the dataset had to manually annotate the data.
### Source Data
#### Initial Data Collection and Normalization
Corpus of domain names scraped from Common Crawl and manually annotated to add word boundaries
#### Who are the source language producers?
[More Information Needed]
### Annotations
#### Annotation process
[More Information Needed]
#### Who are the annotators?
The annotators are the curators of this dataset
### Personal and Sensitive Information
[More Information Needed]
## Considerations for Using the Data
### Social Impact of Dataset
[More Information Needed]
### Discussion of Biases
[More Information Needed]
### Other Known Limitations
[More Information Needed]
## Additional Information
### Dataset Curators
The curators of this dataset are [Jae Hun Ro](https://github.com/JaeHunRo) and [mwurts4google](https://github.com/mwurts4google), who are the contributors of the official Github repository for this dataset. Since the account handles of other curators are unknown currently, the authors of the paper linked to this dataset is mentioned here as curators, [Hao Zhang](https://arxiv.org/search/cs?searchtype=author&query=Zhang%2C+H), [Jae Ro](https://arxiv.org/search/cs?searchtype=author&query=Ro%2C+J), and [Richard Sproat](https://arxiv.org/search/cs?searchtype=author&query=Sproat%2C+R).
### Licensing Information
[MIT License](https://github.com/google-research-datasets/common-crawl-domain-names/blob/master/LICENSE)
### Citation Information
```
@inproceedings{zrs2020urlsegmentation,
title={Semi-supervised URL Segmentation with Recurrent Neural Networks Pre-trained on Knowledge Graph Entities},
author={Hao Zhang and Jae Ro and Richard William Sproat},
booktitle={The 28th International Conference on Computational Linguistics (COLING 2020)},
year={2020}
}
```
### Contributions
Thanks to [@Karthik-Bhaskar](https://github.com/Karthik-Bhaskar) for adding this dataset.
# 通用爬虫(Common Crawl)域名数据集卡片
## 目录
- [数据集说明](#dataset-description)
- [数据集概述](#dataset-summary)
- [支持任务与基准测试榜](#supported-tasks-and-leaderboards)
- [语言](#languages)
- [数据集结构](#dataset-structure)
- [数据实例](#data-instances)
- [数据字段](#data-fields)
- [数据划分](#data-splits)
- [数据集构建](#dataset-creation)
- [构建依据](#curation-rationale)
- [源数据](#source-data)
- [标注信息](#annotations)
- [个人与敏感信息](#personal-and-sensitive-information)
- [数据集使用注意事项](#considerations-for-using-the-data)
- [数据集的社会影响](#social-impact-of-dataset)
- [偏差讨论](#discussion-of-biases)
- [其他已知局限性](#other-known-limitations)
- [附加信息](#additional-information)
- [数据集维护者](#dataset-curators)
- [许可信息](#licensing-information)
- [引用信息](#citation-information)
- [贡献致谢](#contributions)
## 数据集说明
- **主页**:https://github.com/google-research-datasets/common-crawl-domain-names
- **代码仓库**:https://github.com/google-research-datasets/common-crawl-domain-names
- **相关论文**:https://arxiv.org/pdf/2011.03138
- **基准测试榜**:
- **联系人**:
### 数据集概述
本数据集为从通用爬虫(Common Crawl)中抓取的域名语料库,经人工标注以添加词边界(例如将"commoncrawl"拆分为"common crawl")。
将诸如"openresearch"这类域名(domain name)拆分为"open"与"research"这类组成词的操作,对文本到语音合成、网页搜索等应用至关重要。通用爬虫(Common Crawl)是一个开放的网页爬取数据仓库,任何人都可访问并分析其中的数据。具体而言,我们从包含多样大小写格式的URL中抓取纯文本(WET)提取内容里的域名(例如"OpenBSD")。尽管在上述示例中,借助大小写格式即可轻松完成拆分,但部分场景下无法通过该方式实现(例如"NASA"),因此我们需要对数据集进行人工标注。
### 支持任务与基准测试榜
- 文本到语音合成
- 网页搜索
### 语言
en:英语
## 数据集结构
### 数据实例
每个样本均为以空格分隔的域名分段示例。样本保留原始大小写格式,但若先将输入转为小写,则可生成更具挑战性与研究价值的样本。
例如:
Open B S D
NASA
ASAP Workouts
### 数据字段
- `example`:字符串类型特征,代表以空格分隔的域名分段内容。
### 数据划分
| 划分集 | 样本量 | 简单样本数 | 平均输入长度 | 平均分段数 |
|-------|-------|---------|------------------|--------------|
| 训练集 | 17572 | 13718 | 12.63 | 2.65 |
| 验证集 | 1953 | 1536 | 12.77 | 2.67 |
| 测试集 | 2170 | 1714 | 12.63 | 2.66 |
## 数据集构建
### 构建依据
本数据集通过从包含多样大小写格式的URL中抓取纯文本(WET)提取内容里的域名进行构建(例如"OpenBSD")。尽管在上述示例中,借助大小写格式即可轻松完成拆分,但部分场景下无法通过该方式实现(例如"NASA"),因此数据集构建者需对数据进行人工标注。
### 源数据
#### 初始数据收集与标准化
本数据集为从通用爬虫(Common Crawl)中抓取的域名语料库,经人工标注以添加词边界。
#### 源语言生产者
[需补充更多信息]
### 标注信息
#### 标注流程
[需补充更多信息]
#### 标注者
本数据集的标注者即为其构建者。
### 个人与敏感信息
[需补充更多信息]
## 数据集使用注意事项
### 数据集的社会影响
[需补充更多信息]
### 偏差讨论
[需补充更多信息]
### 其他已知局限性
[需补充更多信息]
## 附加信息
### 数据集维护者
本数据集的维护者为[Jae Hun Ro](https://github.com/JaeHunRo)与[mwurts4google](https://github.com/mwurts4google),二者均为本数据集官方GitHub代码仓库的贡献者。由于目前未知其他维护者的账号信息,因此本处提及该数据集相关论文的作者作为维护者:[Hao Zhang](https://arxiv.org/search/cs?searchtype=author&query=Zhang%2C+H)、[Jae Ro](https://arxiv.org/search/cs?searchtype=author&query=Ro%2C+J)与[Richard Sproat](https://arxiv.org/search/cs?searchtype=author&query=Sproat%2C+R)。
### 许可信息
本数据集采用[MIT许可协议](https://github.com/google-research-datasets/common-crawl-domain-names/blob/master/LICENSE)
### 引用信息
@inproceedings{zrs2020urlsegmentation,
title={Semi-supervised URL Segmentation with Recurrent Neural Networks Pre-trained on Knowledge Graph Entities},
author={Hao Zhang and Jae Ro and Richard William Sproat},
booktitle={The 28th International Conference on Computational Linguistics (COLING 2020)},
year={2020}
}
### 贡献致谢
感谢[@Karthik-Bhaskar](https://github.com/Karthik-Bhaskar)为本数据集添加了该卡片。
提供机构:
maas
创建时间:
2025-07-07



