ellenhp/libpostal
收藏Hugging Face2024-03-25 更新2024-06-11 收录
下载链接:
https://hf-mirror.com/datasets/ellenhp/libpostal
下载链接
链接失效反馈官方服务:
资源简介:
---
dataset_info:
- config_name: geoplanet
features:
- name: lang
dtype: string
- name: country
dtype: string
- name: tokens
sequence: string
- name: tags
sequence: int64
splits:
- name: train
num_bytes: 14351561106.506264
num_examples: 130909357
- name: test
num_bytes: 144965292.49373704
num_examples: 1322317
download_size: 4039968748
dataset_size: 14496526399.0
- config_name: openaddresses
features:
- name: lang
dtype: string
- name: country
dtype: string
- name: tokens
sequence: string
- name: tags
sequence: int64
splits:
- name: train
num_bytes: 61976401746.036736
num_examples: 446673083
- name: test
num_bytes: 626024353.9632672
num_examples: 4511850
download_size: 21993145847
dataset_size: 62602426100.0
- config_name: openstreetmap_addresses
features:
- name: lang
dtype: string
- name: country
dtype: string
- name: tokens
sequence: string
- name: tags
sequence: int64
splits:
- name: train
num_bytes: 35406459284.34487
num_examples: 316914300
- name: test
num_bytes: 357641053.655127
num_examples: 3201155
download_size: 13366122418
dataset_size: 35764100338.0
- config_name: openstreetmap_places
features:
- name: lang
dtype: string
- name: country
dtype: string
- name: tokens
sequence: string
- name: tags
sequence: int64
splits:
- name: train
num_bytes: 4341603986.997948
num_examples: 48989431
- name: test
num_bytes: 43854609.00205241
num_examples: 494843
download_size: 1597409288
dataset_size: 4385458596.0
- config_name: openstreetmap_ways
features:
- name: lang
dtype: string
- name: country
dtype: string
- name: tokens
sequence: string
- name: tags
sequence: int64
splits:
- name: train
num_bytes: 9703643777.614073
num_examples: 72476682
- name: test
num_bytes: 98016644.3859272
num_examples: 732088
download_size: 3262932325
dataset_size: 9801660422.0
- config_name: uk_openaddresses
features:
- name: lang
dtype: string
- name: country
dtype: string
- name: tokens
sequence: string
- name: tags
sequence: int64
splits:
- name: train
num_bytes: 212476615.73347768
num_examples: 1724602
- name: test
num_bytes: 2146324.2665223135
num_examples: 17421
download_size: 50229957
dataset_size: 214622940.0
configs:
- config_name: geoplanet
data_files:
- split: train
path: geoplanet/train-*
- split: test
path: geoplanet/test-*
- config_name: openaddresses
data_files:
- split: train
path: openaddresses/train-*
- split: test
path: openaddresses/test-*
- config_name: openstreetmap_addresses
data_files:
- split: train
path: openstreetmap_addresses/train-*
- split: test
path: openstreetmap_addresses/test-*
- config_name: openstreetmap_places
data_files:
- split: train
path: openstreetmap_places/train-*
- split: test
path: openstreetmap_places/test-*
- config_name: openstreetmap_ways
data_files:
- split: train
path: openstreetmap_ways/train-*
- split: test
path: openstreetmap_ways/test-*
- config_name: uk_openaddresses
data_files:
- split: train
path: uk_openaddresses/train-*
- split: test
path: uk_openaddresses/test-*
---
# Under Construction: Libpostal training dataset
For licensing information refer to [libpostal readme](https://github.com/openvenues/libpostal/blob/57eaa414ceadb48d5922099eeaa446b02894a2e4/README.md#parser-training-sets)
提供机构:
ellenhp
原始信息汇总
数据集概述
1. geoplanet
- 特征:
- lang: 字符串
- country: 字符串
- tokens: 字符串序列
- tags: 整数序列
- 分割:
- 训练集: 130909357 个样本, 大小为 14351561106.506264 字节
- 测试集: 1322317 个样本, 大小为 144965292.49373704 字节
- 下载大小: 4039968748 字节
- 数据集大小: 14496526399.0 字节
2. openaddresses
- 特征:
- lang: 字符串
- country: 字符串
- tokens: 字符串序列
- tags: 整数序列
- 分割:
- 训练集: 446673083 个样本, 大小为 61976401746.036736 字节
- 测试集: 4511850 个样本, 大小为 626024353.9632672 字节
- 下载大小: 21993145847 字节
- 数据集大小: 62602426100.0 字节
3. openstreetmap_addresses
- 特征:
- lang: 字符串
- country: 字符串
- tokens: 字符串序列
- tags: 整数序列
- 分割:
- 训练集: 316914300 个样本, 大小为 35406459284.34487 字节
- 测试集: 3201155 个样本, 大小为 357641053.655127 字节
- 下载大小: 13366122418 字节
- 数据集大小: 35764100338.0 字节
4. openstreetmap_places
- 特征:
- lang: 字符串
- country: 字符串
- tokens: 字符串序列
- tags: 整数序列
- 分割:
- 训练集: 48989431 个样本, 大小为 4341603986.997948 字节
- 测试集: 494843 个样本, 大小为 43854609.00205241 字节
- 下载大小: 1597409288 字节
- 数据集大小: 4385458596.0 字节
5. openstreetmap_ways
- 特征:
- lang: 字符串
- country: 字符串
- tokens: 字符串序列
- tags: 整数序列
- 分割:
- 训练集: 72476682 个样本, 大小为 9703643777.614073 字节
- 测试集: 732088 个样本, 大小为 98016644.3859272 字节
- 下载大小: 3262932325 字节
- 数据集大小: 9801660422.0 字节
6. uk_openaddresses
- 特征:
- lang: 字符串
- country: 字符串
- tokens: 字符串序列
- tags: 整数序列
- 分割:
- 训练集: 1724602 个样本, 大小为 212476615.73347768 字节
- 测试集: 17421 个样本, 大小为 2146324.2665223135 字节
- 下载大小: 50229957 字节
- 数据集大小: 214622940.0 字节



