five

Neural Embeddings for Populated GeoNames Locations

收藏
Figshare2018-03-07 更新2026-04-08 收录
下载链接:
https://springernature.figshare.com/articles/dataset/Neural_Embeddings_for_Populated_GeoNames_Locations/5248120/1
下载链接
链接失效反馈
官方服务:
资源简介:
This dataset contains embeddings and associated files for all Geonames populated locations with population greater than 0. These are expressed as DeepWalk embeddings for populated places in Geonames. Specific files and file groups are described below. All embeddings were generated using the <b>gensim </b>package in Python.Geonames is a large, comprehensive knowledge base of geographical locations, both populated and unpopulated, and at different administrative levels. FILES: <b>populated-place-embeddings-{1-7}.jl.gz</b> are compressed JSON lines files where each line is a JSON containing a single key-value pair. For example: {"": [-0.00016037290333770216, -0.16331960260868073, 0.14559058845043182, 0.014049071818590164, 0.016628244891762733, -0.1897198110818863, 0.14188186824321747, 0.11444853991270065, 0.018866628408432007, -0.04606882855296135, -0.15467466413974762, 0.18563739955425262, -0.053200043737888336, -0.16911742091178894, -0.05657876282930374, -0.00039542425656691194, 0.06017935276031494, 0.019657278433442116, 0.014765074476599693, 0.1783478558063507, -0.0005339447525329888, -0.10394854098558426, -0.17260663211345673, 0.022552259266376495, 0.060257066041231155, 0.298672616481781, 0.06802289932966232, -0.03377196565270424, 0.13092969357967377, 0.007119285874068737, -0.011325502768158913, -0.03717223182320595, 0.10660370439291, 0.09025651216506958, 0.023348767310380936, -0.1824614256620407, 0.13780122995376587, -0.03130345419049263, -0.10430310666561127, 0.02684781886637211, 0.06622970849275589, -0.11856741458177567, -0.053687382489442825, 0.14932486414909363, 0.03837030380964279, 0.0744427815079689, 0.08571769297122955, 0.007302495185285807, 0.11334840208292007, 0.0385720431804657, -0.15567876398563385, -0.134855255484581, 0.10208329558372498, 0.07373132556676865, -0.08734219521284103, 0.0397501066327095, -0.07437381893396378, 0.019372688606381416, -0.03451831266283989, 0.015789508819580078, 0.05273417755961418, -0.06393982470035553, -0.09630796313285828, -0.047562047839164734, -0.07217235118150711, -0.05011208727955818, 0.08369360864162445, 0.08336202055215836, 0.05594024807214737, 0.12139936536550522, 0.051332831382751465, 0.027806641533970833, 0.09555482864379883, -0.14043797552585602, -0.003694443963468075, 0.009580886922776699, 0.19412516057491302, 0.05366421863436699, 0.10408374667167664, 0.12925657629966736, -0.07961612194776535, -0.07847319543361664, -0.23122484982013702, -0.1003454178571701, 0.1332612931728363, -0.00953035056591034, 0.030777478590607643, 0.054927386343479156, -0.17683790624141693, 0.0793348103761673, 0.06144792586565018, -0.05212269350886345, 0.07661330699920654, 0.08598912507295609, -0.026763606816530228, 0.0698607936501503, 0.16680659353733063, 0.07281855493783951, -0.0883372575044632, 0.015735004097223282]} is the 100-dimensional embedding of Presidio, which has Geonames ID 3521133. This way, each location is canonically identified but is also human-readable. To combine all the files into a single jl file use the cat command (after decompressing) in a UNIX-like terminal The first 100 lines of populated-place-embeddings-1.jl.gz are in samples-100-locations.jsonl (renamed from .jl to prevent Github from erroneously concluding this is a Julia project!) and can be viewed in a browser<b>mapped_places.txt.gz</b>: To efficiently read things in memory and process, we mapped each of the mnemonic Geonames keys to an integer. E.g. 287145 we use tab to separate the two items. 100 sample lines were written out to samples-100-mapped_places.txt Note that mapped_places.txt contains keys that you will not find in the embedding files. These are places that are not populated anymore i.e. Geonames shows them as having population 0 (or missing a population feature).<b>random-walk-corpus.txt.gz</b>: contains the random walk corpus that was used by DeepWalk for the embeddings. Each line contains a random walk sequence, space separated. Each item in the sequence is the mapped int of the mnemonic Geonames key. We initiated 5 samples per node, up to a maximum depth of <b>weighted_adjacency_list.tsv.gz:</b> contains the directed weighted network that was used to generate the random walks. The items in each line are tab-delimited and are of the form: [node-out] [node-in1] [weight1] [node-in2] [weight2]... where node-out is the node from which the edges are outgoing (to node-in1, node-in2...). The weights represent a probability distribution that is inversely related to the distance between the places represented by the nodes. For more details, see the paper. <b>samples-100-weighted_adjacency_list.tsv</b> contains the first 100 lines of (11) for easy viewing.

本数据集包含所有人口大于0的地名数据库(Geonames)居民地的嵌入向量及相关配套文件。此类嵌入为Geonames中居民地的DeepWalk嵌入向量。具体文件及文件组说明如下。所有嵌入向量均通过Python的gensim工具包生成。Geonames是一个涵盖各类地理实体(包括居民地与非居民地,且覆盖不同行政层级)的大型综合知识库。 **populated-place-embeddings-{1-7}.jl.gz**:为压缩JSON行格式(JSON Lines)文件,其中每一行均为一组键值对JSON数据。例如:`{"": [-0.00016037290333770216, -0.16331960260868073, 0.14559058845043182, 0.014049071818590164, 0.016628244891762733, -0.1897198110818863, 0.14188186824321747, 0.11444853991270065, 0.018866628408432007, -0.04606882855296135, -0.15467466413974762, 0.18563739955425262, -0.053200043737888336, -0.16911742091178894, -0.05657876282930374, -0.00039542425656691194, 0.06017935276031494, 0.019657278433442116, 0.014765074476599693, 0.1783478558063507, -0.0005339447525329888, -0.10394854098558426, -0.17260663211345673, 0.022552259266376495, 0.060257066041231155, 0.298672616481781, 0.06802289932966232, -0.03377196565270424, 0.13092969357967377, 0.007119285874068737, -0.011325502768158913, -0.03717223182320595, 0.10660370439291, 0.09025651216506958, 0.023348767310380936, -0.1824614256620407, 0.13780122995376587, -0.03130345419049263, -0.10430310666561127, 0.02684781886637211, 0.06622970849275589, -0.11856741458177567, -0.053687382489442825, 0.14932486414909363, 0.03837030380964279, 0.0744427815079689, 0.08571769297122955, 0.007302495185285807, 0.11334840208292007, 0.0385720431804657, -0.15567876398563385, -0.134855255484581, 0.10208329558372498, 0.07373132556676865, -0.08734219521284103, 0.0397501066327095, -0.07437381893396378, 0.019372688606381416, -0.03451831266283989, 0.015789508819580078, 0.05273417755961418, -0.06393982470035553, -0.09630796313285828, -0.047562047839164734, -0.07217235118150711, -0.05011208727955818, 0.08369360864162445, 0.08336202055215836, 0.05594024807214737, 0.12139936536550522, 0.051332831382751465, 0.027806641533970833, 0.09555482864379883, -0.14043797552585602, -0.003694443963468075, 0.009580886922776699, 0.19412516057491302, 0.05366421863436699, 0.10408374667167664, 0.12925657629966736, -0.07961612194776535, -0.07847319543361664, -0.23122484982013702, -0.1003454178571701, 0.1332612931728363, -0.00953035056591034, 0.030777478590607643, 0.054927386343479156, -0.17683790624141693, 0.0793348103761673, 0.06144792586565018, -0.05212269350886345, 0.07661330699920654, 0.08598912507295609, -0.026763606816530228, 0.0698607936501503, 0.16680659353733063, 0.07281855493783951, -0.0883372575044632, 0.015735004097223282]}`即为Geonames ID为3521133的普雷西迪奥(Presidio)的100维嵌入向量。每个地点均通过规范标识符进行唯一标识,同时兼具人类可读性。 若需将所有文件合并为单个JL文件,可在类UNIX终端中执行`cat`命令(需先解压文件)。 `populated-place-embeddings-1.jl.gz`的前100行数据存于`samples-100-locations.jsonl`(将后缀从`.jl`重命名为`.jsonl`,以避免GitHub误将其识别为Julia语言项目),可直接在浏览器中查看。 **mapped_places.txt.gz**:为便于内存读取与处理,我们将所有易记的Geonames键值映射为整数,以制表符分隔键值与映射后的整数,例如`287145`。我们提取了100条样本行存入`samples-100-mapped_places.txt`。需注意,`mapped_places.txt`中包含的部分键值无法在嵌入向量文件中找到,此类地点为已非居民地(即Geonames中显示人口为0或缺失人口属性)。 **random-walk-corpus.txt.gz**:包含用于生成DeepWalk嵌入向量的随机游走语料库。每一行代表一条随机游走序列,以空格分隔序列中的元素,序列中的每个元素均为对应Geonames键值的映射整数。我们为每个节点生成5条样本游走序列,最大游走深度未明确标注。**weighted_adjacency_list.tsv.gz**:包含用于生成随机游走的有向加权网络。文件每行以制表符分隔,格式为:[出边节点] [入边节点1] [权重1] [入边节点2] [权重2]……,其中出边节点为向外连接边的源节点(连接至入边节点1、入边节点2等)。权重代表一种概率分布,与节点所代表的地点之间的距离呈反比关系。更多细节可参阅相关论文。 **samples-100-weighted_adjacency_list.tsv**包含该文件的前100行,方便用户快速查看。
提供机构:
Mayank Kejriwal
创建时间:
2017-07-26
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作