five

google-research-datasets/crawl_domain

收藏
Hugging Face2024-01-18 更新2024-06-15 收录
下载链接:
https://hf-mirror.com/datasets/google-research-datasets/crawl_domain
下载链接
链接失效反馈
官方服务:
资源简介:
--- annotations_creators: - expert-generated language_creators: - crowdsourced - expert-generated - found language: - en license: - mit multilinguality: - monolingual size_categories: - 10K<n<100K source_datasets: - extended|other-Common-Crawl - original task_categories: - other task_ids: [] paperswithcode_id: common-crawl-domain-names pretty_name: Common Crawl Domain Names tags: - web-search - text-to-speech dataset_info: features: - name: example dtype: string splits: - name: train num_bytes: 321134 num_examples: 17572 - name: test num_bytes: 39712 num_examples: 2170 - name: validation num_bytes: 36018 num_examples: 1953 download_size: 331763 dataset_size: 396864 --- # Dataset Card for Common Crawl Domain Names ## Table of Contents - [Dataset Description](#dataset-description) - [Dataset Summary](#dataset-summary) - [Supported Tasks and Leaderboards](#supported-tasks-and-leaderboards) - [Languages](#languages) - [Dataset Structure](#dataset-structure) - [Data Instances](#data-instances) - [Data Fields](#data-fields) - [Data Splits](#data-splits) - [Dataset Creation](#dataset-creation) - [Curation Rationale](#curation-rationale) - [Source Data](#source-data) - [Annotations](#annotations) - [Personal and Sensitive Information](#personal-and-sensitive-information) - [Considerations for Using the Data](#considerations-for-using-the-data) - [Social Impact of Dataset](#social-impact-of-dataset) - [Discussion of Biases](#discussion-of-biases) - [Other Known Limitations](#other-known-limitations) - [Additional Information](#additional-information) - [Dataset Curators](#dataset-curators) - [Licensing Information](#licensing-information) - [Citation Information](#citation-information) - [Contributions](#contributions) ## Dataset Description - **Homepage:** https://github.com/google-research-datasets/common-crawl-domain-names - **Repository:** https://github.com/google-research-datasets/common-crawl-domain-names - **Paper:** https://arxiv.org/pdf/2011.03138 - **Leaderboard:** - **Point of Contact:** ### Dataset Summary Corpus of domain names scraped from Common Crawl and manually annotated to add word boundaries (e.g. "commoncrawl" to "common crawl"). Breaking [domain names](https://developer.mozilla.org/en-US/docs/Learn/Common_questions/What_is_a_URL) such as "openresearch" into component words "open" and "research" is important for applications such as Text-to-Speech synthesis and web search. [Common Crawl](https://commoncrawl.org/) is an open repository of web crawl data that can be accessed and analyzed by anyone. Specifically, we scraped the plaintext (WET) extracts for domain names from URLs that contained diverse letter casing (e.g. "OpenBSD"). Although in the previous example, segmentation is trivial using letter casing, this was not always the case (e.g. "NASA"), so we had to manually annotate the data. ### Supported Tasks and Leaderboards - Text-to-Speech synthesis - Web search ### Languages en: English ## Dataset Structure ### Data Instances Each sample is an example of space separated segments of a domain name. The examples are stored in their original letter casing, but harder and more interesting examples can be generated by lowercasing the input first. For example: ``` Open B S D NASA ASAP Workouts ``` ### Data Fields - `example`: a `string` feature: space separated segments of a domain name. ### Data Splits | split | size | trivial | avg_input_length | avg_segments | |-------|-------|---------|------------------|--------------| | train | 17572 | 13718 | 12.63 | 2.65 | | eval | 1953 | 1536 | 12.77 | 2.67 | | test | 2170 | 1714 | 12.63 | 2.66 | ## Dataset Creation ### Curation Rationale The dataset was curated by scraping the plaintext (WET) extracts for domain names from URLs that contained diverse letter casing (e.g. "OpenBSD"). Although in the previous example, segmentation is trivial using letter casing, this was not always the case (e.g. "NASA"), so the curators of the dataset had to manually annotate the data. ### Source Data #### Initial Data Collection and Normalization Corpus of domain names scraped from Common Crawl and manually annotated to add word boundaries #### Who are the source language producers? [More Information Needed] ### Annotations #### Annotation process [More Information Needed] #### Who are the annotators? The annotators are the curators of this dataset ### Personal and Sensitive Information [More Information Needed] ## Considerations for Using the Data ### Social Impact of Dataset [More Information Needed] ### Discussion of Biases [More Information Needed] ### Other Known Limitations [More Information Needed] ## Additional Information ### Dataset Curators The curators of this dataset are [Jae Hun Ro](https://github.com/JaeHunRo) and [mwurts4google](https://github.com/mwurts4google), who are the contributors of the official Github repository for this dataset. Since the account handles of other curators are unknown currently, the authors of the paper linked to this dataset is mentioned here as curators, [Hao Zhang](https://arxiv.org/search/cs?searchtype=author&query=Zhang%2C+H), [Jae Ro](https://arxiv.org/search/cs?searchtype=author&query=Ro%2C+J), and [Richard Sproat](https://arxiv.org/search/cs?searchtype=author&query=Sproat%2C+R). ### Licensing Information [MIT License](https://github.com/google-research-datasets/common-crawl-domain-names/blob/master/LICENSE) ### Citation Information ``` @inproceedings{zrs2020urlsegmentation, title={Semi-supervised URL Segmentation with Recurrent Neural Networks Pre-trained on Knowledge Graph Entities}, author={Hao Zhang and Jae Ro and Richard William Sproat}, booktitle={The 28th International Conference on Computational Linguistics (COLING 2020)}, year={2020} } ``` ### Contributions Thanks to [@Karthik-Bhaskar](https://github.com/Karthik-Bhaskar) for adding this dataset.
提供机构:
google-research-datasets
原始信息汇总

数据集卡片 for Common Crawl Domain Names

数据集描述

数据集摘要

从Common Crawl抓取的域名语料库,并手动标注添加单词边界(例如,将“commoncrawl”标注为“common crawl”)。

将域名(如“openresearch”)分解为“open”和“research”等组件单词对于文本到语音合成和网络搜索等应用非常重要。Common Crawl是一个可以被任何人访问和分析的网络爬虫数据开放存储库。具体来说,我们从包含多样化字母大小写(例如“OpenBSD”)的URL中抓取了纯文本(WET)提取的域名。虽然在之前的例子中,使用字母大小写进行分割是简单的,但并非总是如此(例如“NASA”),因此我们需要手动标注数据。

支持的任务和排行榜

  • 文本到语音合成
  • 网络搜索

语言

en: 英语

数据集结构

数据实例

每个样本是一个域名分段示例,用空格分隔。示例保留了原始的字母大小写,但通过首先将输入转换为小写,可以生成更难和更有趣的示例。

例如:

Open B S D NASA ASAP Workouts

数据字段

  • example: 一个string特征:用空格分隔的域名分段。

数据分割

split size trivial avg_input_length avg_segments
train 17572 13718 12.63 2.65
eval 1953 1536 12.77 2.67
test 2170 1714 12.63 2.66

数据集创建

策划理由

该数据集是通过从包含多样化字母大小写(例如“OpenBSD”)的URL中抓取纯文本(WET)提取的域名来策划的。虽然在之前的例子中,使用字母大小写进行分割是简单的,但并非总是如此(例如“NASA”),因此数据集的策划者需要手动标注数据。

源数据

初始数据收集和规范化

从Common Crawl抓取的域名语料库,并手动标注添加单词边界

源语言生产者

[更多信息需要]

标注

标注过程

[更多信息需要]

标注者

标注者是该数据集的策划者

个人和敏感信息

[更多信息需要]

使用数据的考虑

数据集的社会影响

[更多信息需要]

偏见的讨论

[更多信息需要]

其他已知限制

[更多信息需要]

附加信息

数据集策划者

该数据集的策划者是Jae Hun Romwurts4google,他们是该数据集官方Github仓库的贡献者。由于其他策划者的账户处理未知,此处提及与该数据集相关的论文的作者作为策划者,Hao ZhangJae Ro,和Richard Sproat

许可信息

MIT许可证

引用信息

@inproceedings{zrs2020urlsegmentation, title={Semi-supervised URL Segmentation with Recurrent Neural Networks Pre-trained on Knowledge Graph Entities}, author={Hao Zhang and Jae Ro and Richard William Sproat}, booktitle={The 28th International Conference on Computational Linguistics (COLING 2020)}, year={2020} }

贡献

感谢@Karthik-Bhaskar添加此数据集。

5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作