five

ellenhp/libpostal

收藏
Hugging Face2024-03-25 更新2024-06-11 收录
下载链接:
https://hf-mirror.com/datasets/ellenhp/libpostal
下载链接
链接失效反馈
官方服务:
资源简介:
--- dataset_info: - config_name: geoplanet features: - name: lang dtype: string - name: country dtype: string - name: tokens sequence: string - name: tags sequence: int64 splits: - name: train num_bytes: 14351561106.506264 num_examples: 130909357 - name: test num_bytes: 144965292.49373704 num_examples: 1322317 download_size: 4039968748 dataset_size: 14496526399.0 - config_name: openaddresses features: - name: lang dtype: string - name: country dtype: string - name: tokens sequence: string - name: tags sequence: int64 splits: - name: train num_bytes: 61976401746.036736 num_examples: 446673083 - name: test num_bytes: 626024353.9632672 num_examples: 4511850 download_size: 21993145847 dataset_size: 62602426100.0 - config_name: openstreetmap_addresses features: - name: lang dtype: string - name: country dtype: string - name: tokens sequence: string - name: tags sequence: int64 splits: - name: train num_bytes: 35406459284.34487 num_examples: 316914300 - name: test num_bytes: 357641053.655127 num_examples: 3201155 download_size: 13366122418 dataset_size: 35764100338.0 - config_name: openstreetmap_places features: - name: lang dtype: string - name: country dtype: string - name: tokens sequence: string - name: tags sequence: int64 splits: - name: train num_bytes: 4341603986.997948 num_examples: 48989431 - name: test num_bytes: 43854609.00205241 num_examples: 494843 download_size: 1597409288 dataset_size: 4385458596.0 - config_name: openstreetmap_ways features: - name: lang dtype: string - name: country dtype: string - name: tokens sequence: string - name: tags sequence: int64 splits: - name: train num_bytes: 9703643777.614073 num_examples: 72476682 - name: test num_bytes: 98016644.3859272 num_examples: 732088 download_size: 3262932325 dataset_size: 9801660422.0 - config_name: uk_openaddresses features: - name: lang dtype: string - name: country dtype: string - name: tokens sequence: string - name: tags sequence: int64 splits: - name: train num_bytes: 212476615.73347768 num_examples: 1724602 - name: test num_bytes: 2146324.2665223135 num_examples: 17421 download_size: 50229957 dataset_size: 214622940.0 configs: - config_name: geoplanet data_files: - split: train path: geoplanet/train-* - split: test path: geoplanet/test-* - config_name: openaddresses data_files: - split: train path: openaddresses/train-* - split: test path: openaddresses/test-* - config_name: openstreetmap_addresses data_files: - split: train path: openstreetmap_addresses/train-* - split: test path: openstreetmap_addresses/test-* - config_name: openstreetmap_places data_files: - split: train path: openstreetmap_places/train-* - split: test path: openstreetmap_places/test-* - config_name: openstreetmap_ways data_files: - split: train path: openstreetmap_ways/train-* - split: test path: openstreetmap_ways/test-* - config_name: uk_openaddresses data_files: - split: train path: uk_openaddresses/train-* - split: test path: uk_openaddresses/test-* --- # Under Construction: Libpostal training dataset For licensing information refer to [libpostal readme](https://github.com/openvenues/libpostal/blob/57eaa414ceadb48d5922099eeaa446b02894a2e4/README.md#parser-training-sets)
提供机构:
ellenhp
原始信息汇总

数据集概述

1. geoplanet

  • 特征:
    • lang: 字符串
    • country: 字符串
    • tokens: 字符串序列
    • tags: 整数序列
  • 分割:
    • 训练集: 130909357 个样本, 大小为 14351561106.506264 字节
    • 测试集: 1322317 个样本, 大小为 144965292.49373704 字节
  • 下载大小: 4039968748 字节
  • 数据集大小: 14496526399.0 字节

2. openaddresses

  • 特征:
    • lang: 字符串
    • country: 字符串
    • tokens: 字符串序列
    • tags: 整数序列
  • 分割:
    • 训练集: 446673083 个样本, 大小为 61976401746.036736 字节
    • 测试集: 4511850 个样本, 大小为 626024353.9632672 字节
  • 下载大小: 21993145847 字节
  • 数据集大小: 62602426100.0 字节

3. openstreetmap_addresses

  • 特征:
    • lang: 字符串
    • country: 字符串
    • tokens: 字符串序列
    • tags: 整数序列
  • 分割:
    • 训练集: 316914300 个样本, 大小为 35406459284.34487 字节
    • 测试集: 3201155 个样本, 大小为 357641053.655127 字节
  • 下载大小: 13366122418 字节
  • 数据集大小: 35764100338.0 字节

4. openstreetmap_places

  • 特征:
    • lang: 字符串
    • country: 字符串
    • tokens: 字符串序列
    • tags: 整数序列
  • 分割:
    • 训练集: 48989431 个样本, 大小为 4341603986.997948 字节
    • 测试集: 494843 个样本, 大小为 43854609.00205241 字节
  • 下载大小: 1597409288 字节
  • 数据集大小: 4385458596.0 字节

5. openstreetmap_ways

  • 特征:
    • lang: 字符串
    • country: 字符串
    • tokens: 字符串序列
    • tags: 整数序列
  • 分割:
    • 训练集: 72476682 个样本, 大小为 9703643777.614073 字节
    • 测试集: 732088 个样本, 大小为 98016644.3859272 字节
  • 下载大小: 3262932325 字节
  • 数据集大小: 9801660422.0 字节

6. uk_openaddresses

  • 特征:
    • lang: 字符串
    • country: 字符串
    • tokens: 字符串序列
    • tags: 整数序列
  • 分割:
    • 训练集: 1724602 个样本, 大小为 212476615.73347768 字节
    • 测试集: 17421 个样本, 大小为 2146324.2665223135 字节
  • 下载大小: 50229957 字节
  • 数据集大小: 214622940.0 字节
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作