DUKweb
收藏arXiv2021-10-25 更新2024-08-06 收录
下载链接:
http://arxiv.org/abs/2107.01076v2
下载链接
链接失效反馈官方服务:
资源简介:
DUKweb是由英国伦敦艾伦图灵研究所创建的大规模资源,用于对当代英语进行历时分析。该数据集源自JISC英国网络域数据集(1996-2013),这是一个庞大的档案,收集了来自互联网档案馆托管在'.uk'域名下的资源。DUKweb包含了一系列单词共现矩阵和两种类型的单词嵌入,每年都有。数据集大小为330GB,包含1996至2013年间的1,316亿个单词出现。创建过程涉及从JISC数据集中提取文本资源,并使用Temporal Random Indexing和word2vec算法训练单词嵌入。DUKweb的应用领域包括单词相似性、相关性、类比以及语义变化检测等任务,旨在解决语言随时间变化,特别是在互联网和社交媒体影响下的语言变化问题。
DUKweb is a large-scale resource developed by the Alan Turing Institute in London, UK, for diachronic analysis of contemporary English. The dataset is derived from the JISC UK Web Domain Dataset (1996–2013), a massive archive collecting resources hosted on '.uk' domain names by the Internet Archive. DUKweb includes a series of annual word co-occurrence matrices and two types of word embeddings. With a total size of 330 GB, the dataset contains 131.6 billion word occurrences spanning from 1996 to 2013. Its creation process involves extracting text resources from the JISC dataset, and training word embeddings using the Temporal Random Indexing and word2vec algorithms. Applications of DUKweb cover tasks such as word similarity, correlation, analogy, and semantic change detection, aiming to address language change over time, especially those shifts driven by the Internet and social media.
提供机构:
英国伦敦艾伦图灵研究所
创建时间:
2021-07-02



