冰岛公共爬虫语料库 (IC3)
收藏arXiv2022-01-18 更新2024-06-21 收录
下载链接:
https://huggingface.co/datasets/mideind/icelandic-common-crawl-corpus-IC3
下载链接
链接失效反馈官方服务:
资源简介:
冰岛公共爬虫语料库 (IC3) 是由冰岛微智公司和冰岛大学联合创建的一个高质量文本集合,专门针对冰岛顶级域名.is进行网络爬取。该数据集包含约63.5百万个网页,总文本量达到16GB,涵盖了多种领域和主题的文本。IC3的创建过程涉及高效的网络爬取技术和严格的数据清理步骤,确保了数据的质量和多样性。该数据集主要用于训练冰岛语言模型,如IceBERT,以支持自然语言处理任务,包括词性标注、命名实体识别、语法错误检测和成分解析等。通过IC3,研究团队展示了即使对于中低资源语言,通过适当清理的网络爬取数据也足以达到最先进的性能。
The Icelandic Public Web Crawl Corpus (IC3) is a high-quality text collection jointly developed by Microwise Iceland and the University of Iceland, specifically focused on web crawling of the Icelandic top-level domain .is. This corpus contains approximately 63.5 million web pages, with a total text size of 16 GB, covering texts across diverse domains and topics. The construction of IC3 employs efficient web crawling technologies and rigorous data cleaning procedures, which guarantee the quality and diversity of the dataset. This resource is primarily used to train Icelandic language models such as IceBERT, supporting a variety of natural language processing tasks including part-of-speech tagging, named entity recognition, grammatical error detection, and constituency parsing. Through IC3, the research team demonstrated that even for low- and medium-resource languages, properly cleaned web-crawled data is sufficient to achieve state-of-the-art performance.
提供机构:
冰岛微智公司和冰岛大学
创建时间:
2022-01-15



