five

deepparse/worldwide-addresses

收藏
Hugging Face2026-01-17 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/deepparse/worldwide-addresses
下载链接
链接失效反馈
官方服务:
资源简介:
--- configs: - config_name: wf data_files: - split: train path: - "wf/chunk-0.parquet" - config_name: lb data_files: - split: train path: - "lb/chunk-0.parquet" - config_name: ke data_files: - split: train path: - "ke/chunk-0.parquet" - config_name: gr data_files: - split: train path: - "gr/chunk-0.parquet" - config_name: lc data_files: - split: train path: - "lc/chunk-0.parquet" - config_name: ni data_files: - split: train path: - "ni/chunk-0.parquet" - config_name: ax data_files: - split: train path: - "ax/chunk-0.parquet" - config_name: iq data_files: - split: train path: - "iq/chunk-0.parquet" - config_name: eg data_files: - split: train path: - "eg/chunk-0.parquet" - config_name: ky data_files: - split: train path: - "ky/chunk-0.parquet" - config_name: us data_files: - split: train path: - "us/chunk-0.parquet" - "us/chunk-1.parquet" - "us/chunk-2.parquet" - "us/chunk-3.parquet" - config_name: ar data_files: - split: train path: - "ar/chunk-0.parquet" - config_name: gm data_files: - split: train path: - "gm/chunk-0.parquet" - config_name: rw data_files: - split: train path: - "rw/chunk-0.parquet" - config_name: sk data_files: - split: train path: - "sk/chunk-0.parquet" - config_name: mx data_files: - split: train path: - "mx/chunk-0.parquet" - config_name: km data_files: - split: train path: - "km/chunk-0.parquet" - config_name: sz data_files: - split: train path: - "sz/chunk-0.parquet" - config_name: hn data_files: - split: train path: - "hn/chunk-0.parquet" - config_name: im data_files: - split: train path: - "im/chunk-0.parquet" - config_name: as data_files: - split: train path: - "as/chunk-0.parquet" - config_name: cr data_files: - split: train path: - "cr/chunk-0.parquet" - config_name: fr data_files: - split: train path: - "fr/chunk-0.parquet" - config_name: io data_files: - split: train path: - "io/chunk-0.parquet" - config_name: do data_files: - split: train path: - "do/chunk-0.parquet" - config_name: bz data_files: - split: train path: - "bz/chunk-0.parquet" - config_name: co data_files: - split: train path: - "co/chunk-0.parquet" - config_name: ro data_files: - split: train path: - "ro/chunk-0.parquet" - config_name: fk data_files: - split: train path: - "fk/chunk-0.parquet" - config_name: mw data_files: - split: train path: - "mw/chunk-0.parquet" - config_name: la data_files: - split: train path: - "la/chunk-0.parquet" - config_name: lu data_files: - split: train path: - "lu/chunk-0.parquet" - config_name: pf data_files: - split: train path: - "pf/chunk-0.parquet" - config_name: pk data_files: - split: train path: - "pk/chunk-0.parquet" - config_name: mq data_files: - split: train path: - "mq/chunk-0.parquet" - config_name: vn data_files: - split: train path: - "vn/chunk-0.parquet" - config_name: sh data_files: - split: train path: - "sh/chunk-0.parquet" - config_name: li data_files: - split: train path: - "li/chunk-0.parquet" - config_name: vc data_files: - split: train path: - "vc/chunk-0.parquet" - config_name: mv data_files: - split: train path: - "mv/chunk-0.parquet" - config_name: re data_files: - split: train path: - "re/chunk-0.parquet" - config_name: bo data_files: - split: train path: - "bo/chunk-0.parquet" - config_name: vg data_files: - split: train path: - "vg/chunk-0.parquet" - config_name: id data_files: - split: train path: - "id/chunk-0.parquet" - config_name: ge data_files: - split: train path: - "ge/chunk-0.parquet" - config_name: dz data_files: - split: train path: - "dz/chunk-0.parquet" - config_name: np data_files: - split: train path: - "np/chunk-0.parquet" - config_name: bq data_files: - split: train path: - "bq/chunk-0.parquet" - config_name: pa data_files: - split: train path: - "pa/chunk-0.parquet" - config_name: cz data_files: - split: train path: - "cz/chunk-0.parquet" - config_name: bs data_files: - split: train path: - "bs/chunk-0.parquet" - config_name: pl data_files: - split: train path: - "pl/chunk-0.parquet" - config_name: ec data_files: - split: train path: - "ec/chunk-0.parquet" - config_name: vu data_files: - split: train path: - "vu/chunk-0.parquet" - config_name: ie data_files: - split: train path: - "ie/chunk-0.parquet" - config_name: kp data_files: - split: train path: - "kp/chunk-0.parquet" - config_name: pn data_files: - split: train path: - "pn/chunk-0.parquet" - config_name: bm data_files: - split: train path: - "bm/chunk-0.parquet" - config_name: my data_files: - split: train path: - "my/chunk-0.parquet" - config_name: mz data_files: - split: train path: - "mz/chunk-0.parquet" - config_name: br data_files: - split: train path: - "br/chunk-0.parquet" - "br/chunk-1.parquet" - config_name: py data_files: - split: train path: - "py/chunk-0.parquet" - config_name: ps data_files: - split: train path: - "ps/chunk-0.parquet" - config_name: mr data_files: - split: train path: - "mr/chunk-0.parquet" - config_name: pt data_files: - split: train path: - "pt/chunk-0.parquet" - config_name: cd data_files: - split: train path: - "cd/chunk-0.parquet" - config_name: uy data_files: - split: train path: - "uy/chunk-0.parquet" - config_name: hk data_files: - split: train path: - "hk/chunk-0.parquet" - config_name: et data_files: - split: train path: - "et/chunk-0.parquet" - config_name: dk data_files: - split: train path: - "dk/chunk-0.parquet" - config_name: vi data_files: - split: train path: - "vi/chunk-0.parquet" - config_name: mf data_files: - split: train path: - "mf/chunk-0.parquet" - config_name: ao data_files: - split: train path: - "ao/chunk-0.parquet" - config_name: hu data_files: - split: train path: - "hu/chunk-0.parquet" - config_name: nz data_files: - split: train path: - "nz/chunk-0.parquet" - config_name: mc data_files: - split: train path: - "mc/chunk-0.parquet" - config_name: az data_files: - split: train path: - "az/chunk-0.parquet" - config_name: cc data_files: - split: train path: - "cc/chunk-0.parquet" - config_name: ht data_files: - split: train path: - "ht/chunk-0.parquet" - config_name: so data_files: - split: train path: - "so/chunk-0.parquet" - config_name: nc data_files: - split: train path: - "nc/chunk-0.parquet" - config_name: mg data_files: - split: train path: - "mg/chunk-0.parquet" - config_name: rs data_files: - split: train path: - "rs/chunk-0.parquet" - config_name: au data_files: - split: train path: - "au/chunk-0.parquet" - config_name: ly data_files: - split: train path: - "ly/chunk-0.parquet" - config_name: ph data_files: - split: train path: - "ph/chunk-0.parquet" - config_name: aw data_files: - split: train path: - "aw/chunk-0.parquet" - config_name: va data_files: - split: train path: - "va/chunk-0.parquet" - config_name: tz data_files: - split: train path: - "tz/chunk-0.parquet" - config_name: it data_files: - split: train path: - "it/chunk-0.parquet" - config_name: tt data_files: - split: train path: - "tt/chunk-0.parquet" - config_name: bg data_files: - split: train path: - "bg/chunk-0.parquet" - config_name: gl data_files: - split: train path: - "gl/chunk-0.parquet" - config_name: sb data_files: - split: train path: - "sb/chunk-0.parquet" - config_name: bn data_files: - split: train path: - "bn/chunk-0.parquet" - config_name: bf data_files: - split: train path: - "bf/chunk-0.parquet" - config_name: lt data_files: - split: train path: - "lt/chunk-0.parquet" - config_name: om data_files: - split: train path: - "om/chunk-0.parquet" - config_name: gy data_files: - split: train path: - "gy/chunk-0.parquet" - config_name: tj data_files: - split: train path: - "tj/chunk-0.parquet" - config_name: tc data_files: - split: train path: - "tc/chunk-0.parquet" - config_name: qa data_files: - split: train path: - "qa/chunk-0.parquet" - config_name: gp data_files: - split: train path: - "gp/chunk-0.parquet" - config_name: gq data_files: - split: train path: - "gq/chunk-0.parquet" - config_name: za data_files: - split: train path: - "za/chunk-0.parquet" - config_name: cn data_files: - split: train path: - "cn/chunk-0.parquet" - config_name: tf data_files: - split: train path: - "tf/chunk-0.parquet" - config_name: st data_files: - split: train path: - "st/chunk-0.parquet" - config_name: dj data_files: - split: train path: - "dj/chunk-0.parquet" - config_name: mh data_files: - split: train path: - "mh/chunk-0.parquet" - config_name: ag data_files: - split: train path: - "ag/chunk-0.parquet" - config_name: sy data_files: - split: train path: - "sy/chunk-0.parquet" - config_name: ci data_files: - split: train path: - "ci/chunk-0.parquet" - config_name: ga data_files: - split: train path: - "ga/chunk-0.parquet" - config_name: ai data_files: - split: train path: - "ai/chunk-0.parquet" - config_name: kw data_files: - split: train path: - "kw/chunk-0.parquet" - config_name: ir data_files: - split: train path: - "ir/chunk-0.parquet" - config_name: ng data_files: - split: train path: - "ng/chunk-0.parquet" - config_name: zw data_files: - split: train path: - "zw/chunk-0.parquet" - config_name: sd data_files: - split: train path: - "sd/chunk-0.parquet" - config_name: bw data_files: - split: train path: - "bw/chunk-0.parquet" - config_name: sa data_files: - split: train path: - "sa/chunk-0.parquet" - config_name: sv data_files: - split: train path: - "sv/chunk-0.parquet" - config_name: al data_files: - split: train path: - "al/chunk-0.parquet" - config_name: md data_files: - split: train path: - "md/chunk-0.parquet" - config_name: kz data_files: - split: train path: - "kz/chunk-0.parquet" - config_name: tr data_files: - split: train path: - "tr/chunk-0.parquet" - config_name: gb data_files: - split: train path: - "gb/chunk-0.parquet" - "gb/chunk-1.parquet" - config_name: cg data_files: - split: train path: - "cg/chunk-0.parquet" - config_name: ve data_files: - split: train path: - "ve/chunk-0.parquet" - config_name: cm data_files: - split: train path: - "cm/chunk-0.parquet" - config_name: ca data_files: - split: train path: - "ca/chunk-0.parquet" - "ca/chunk-1.parquet" - config_name: mt data_files: - split: train path: - "mt/chunk-0.parquet" - config_name: ba data_files: - split: train path: - "ba/chunk-0.parquet" - config_name: sn data_files: - split: train path: - "sn/chunk-0.parquet" - config_name: ne data_files: - split: train path: - "ne/chunk-0.parquet" - config_name: fj data_files: - split: train path: - "fj/chunk-0.parquet" - config_name: ki data_files: - split: train path: - "ki/chunk-0.parquet" - config_name: si data_files: - split: train path: - "si/chunk-0.parquet" - config_name: nf data_files: - split: train path: - "nf/chunk-0.parquet" - config_name: sg data_files: - split: train path: - "sg/chunk-0.parquet" - config_name: tv data_files: - split: train path: - "tv/chunk-0.parquet" - config_name: bj data_files: - split: train path: - "bj/chunk-0.parquet" - config_name: ss data_files: - split: train path: - "ss/chunk-0.parquet" - config_name: mp data_files: - split: train path: - "mp/chunk-0.parquet" - config_name: ml data_files: - split: train path: - "ml/chunk-0.parquet" - config_name: tn data_files: - split: train path: - "tn/chunk-0.parquet" - config_name: jm data_files: - split: train path: - "jm/chunk-0.parquet" - config_name: es data_files: - split: train path: - "es/chunk-0.parquet" - config_name: de data_files: - split: train path: - "de/chunk-0.parquet" - "de/chunk-1.parquet" - config_name: cf data_files: - split: train path: - "cf/chunk-0.parquet" - config_name: tw data_files: - split: train path: - "tw/chunk-0.parquet" - config_name: zm data_files: - split: train path: - "zm/chunk-0.parquet" - config_name: ch data_files: - split: train path: - "ch/chunk-0.parquet" - config_name: lv data_files: - split: train path: - "lv/chunk-0.parquet" - config_name: ua data_files: - split: train path: - "ua/chunk-0.parquet" - config_name: kr data_files: - split: train path: - "kr/chunk-0.parquet" - config_name: gu data_files: - split: train path: - "gu/chunk-0.parquet" - config_name: cl data_files: - split: train path: - "cl/chunk-0.parquet" - config_name: kh data_files: - split: train path: - "kh/chunk-0.parquet" - config_name: ls data_files: - split: train path: - "ls/chunk-0.parquet" - config_name: mu data_files: - split: train path: - "mu/chunk-0.parquet" - config_name: nu data_files: - split: train path: - "nu/chunk-0.parquet" - config_name: gd data_files: - split: train path: - "gd/chunk-0.parquet" - config_name: um data_files: - split: train path: - "um/chunk-0.parquet" - config_name: in data_files: - split: train path: - "in/chunk-0.parquet" - config_name: sr data_files: - split: train path: - "sr/chunk-0.parquet" - config_name: td data_files: - split: train path: - "td/chunk-0.parquet" - config_name: ad data_files: - split: train path: - "ad/chunk-0.parquet" - config_name: se data_files: - split: train path: - "se/chunk-0.parquet" - config_name: sl data_files: - split: train path: - "sl/chunk-0.parquet" - config_name: gf data_files: - split: train path: - "gf/chunk-0.parquet" - config_name: yt data_files: - split: train path: - "yt/chunk-0.parquet" - config_name: fm data_files: - split: train path: - "fm/chunk-0.parquet" - config_name: am data_files: - split: train path: - "am/chunk-0.parquet" - config_name: sc data_files: - split: train path: - "sc/chunk-0.parquet" - config_name: bd data_files: - split: train path: - "bd/chunk-0.parquet" - config_name: tl data_files: - split: train path: - "tl/chunk-0.parquet" - config_name: kg data_files: - split: train path: - "kg/chunk-0.parquet" - config_name: ye data_files: - split: train path: - "ye/chunk-0.parquet" - config_name: kn data_files: - split: train path: - "kn/chunk-0.parquet" - config_name: pe data_files: - split: train path: - "pe/chunk-0.parquet" - config_name: at data_files: - split: train path: - "at/chunk-0.parquet" - config_name: tg data_files: - split: train path: - "tg/chunk-0.parquet" - config_name: pm data_files: - split: train path: - "pm/chunk-0.parquet" - config_name: me data_files: - split: train path: - "me/chunk-0.parquet" - config_name: 'no' data_files: - split: train path: - "no/chunk-0.parquet" - config_name: gh data_files: - split: train path: - "gh/chunk-0.parquet" - config_name: bh data_files: - split: train path: - "bh/chunk-0.parquet" - config_name: ws data_files: - split: train path: - "ws/chunk-0.parquet" - config_name: nl data_files: - split: train path: - "nl/chunk-0.parquet" - config_name: is data_files: - split: train path: - "is/chunk-0.parquet" - config_name: lk data_files: - split: train path: - "lk/chunk-0.parquet" - config_name: fi data_files: - split: train path: - "fi/chunk-0.parquet" - config_name: bt data_files: - split: train path: - "bt/chunk-0.parquet" - config_name: gn data_files: - split: train path: - "gn/chunk-0.parquet" - config_name: cx data_files: - split: train path: - "cx/chunk-0.parquet" - config_name: cv data_files: - split: train path: - "cv/chunk-0.parquet" - config_name: mn data_files: - split: train path: - "mn/chunk-0.parquet" - config_name: mm data_files: - split: train path: - "mm/chunk-0.parquet" - config_name: bl data_files: - split: train path: - "bl/chunk-0.parquet" - config_name: af data_files: - split: train path: - "af/chunk-0.parquet" - config_name: ee data_files: - split: train path: - "ee/chunk-0.parquet" - config_name: mo data_files: - split: train path: - "mo/chunk-0.parquet" - config_name: cu data_files: - split: train path: - "cu/chunk-0.parquet" - config_name: er data_files: - split: train path: - "er/chunk-0.parquet" - config_name: lr data_files: - split: train path: - "lr/chunk-0.parquet" - config_name: sx data_files: - split: train path: - "sx/chunk-0.parquet" - config_name: uz data_files: - split: train path: - "uz/chunk-0.parquet" - config_name: dm data_files: - split: train path: - "dm/chunk-0.parquet" - config_name: ms data_files: - split: train path: - "ms/chunk-0.parquet" - config_name: to data_files: - split: train path: - "to/chunk-0.parquet" - config_name: pw data_files: - split: train path: - "pw/chunk-0.parquet" - config_name: na data_files: - split: train path: - "na/chunk-0.parquet" - config_name: pg data_files: - split: train path: - "pg/chunk-0.parquet" - config_name: be data_files: - split: train path: - "be/chunk-0.parquet" - config_name: bb data_files: - split: train path: - "bb/chunk-0.parquet" - config_name: gg data_files: - split: train path: - "gg/chunk-0.parquet" - config_name: th data_files: - split: train path: - "th/chunk-0.parquet" - config_name: ae data_files: - split: train path: - "ae/chunk-0.parquet" - config_name: mk data_files: - split: train path: - "mk/chunk-0.parquet" - config_name: ck data_files: - split: train path: - "ck/chunk-0.parquet" - config_name: hr data_files: - split: train path: - "hr/chunk-0.parquet" - config_name: ug data_files: - split: train path: - "ug/chunk-0.parquet" - config_name: il data_files: - split: train path: - "il/chunk-0.parquet" - config_name: fo data_files: - split: train path: - "fo/chunk-0.parquet" - config_name: ru data_files: - split: train path: - "ru/chunk-0.parquet" - config_name: jo data_files: - split: train path: - "jo/chunk-0.parquet" - config_name: tm data_files: - split: train path: - "tm/chunk-0.parquet" - config_name: jp data_files: - split: train path: - "jp/chunk-0.parquet" - config_name: gt data_files: - split: train path: - "gt/chunk-0.parquet" - config_name: gw data_files: - split: train path: - "gw/chunk-0.parquet" - config_name: cy data_files: - split: train path: - "cy/chunk-0.parquet" - config_name: cw data_files: - split: train path: - "cw/chunk-0.parquet" - config_name: sm data_files: - split: train path: - "sm/chunk-0.parquet" - config_name: nr data_files: - split: train path: - "nr/chunk-0.parquet" - config_name: ma data_files: - split: train path: - "ma/chunk-0.parquet" - config_name: bi data_files: - split: train path: - "bi/chunk-0.parquet" - config_name: by data_files: - split: train path: - "by/chunk-0.parquet" - config_name: pr data_files: - split: train path: - "pr/chunk-0.parquet" task_categories: - token-classification license: cc-by-4.0 --- # Dataset Card for worldwide-addresses ## Table of Contents - [Table of Contents](#table-of-contents) - [Dataset Description](#dataset-description) - [Dataset Summary](#dataset-summary) - [Supported Tasks and Leaderboards](#supported-tasks-and-leaderboards) - [Languages](#languages) - [Dataset Structure](#dataset-structure) - [Data Instances](#data-instances) - [Data Fields](#data-fields) - [Dataset Creation](#dataset-creation) - [Additional Information](#additional-information) - [Licensing Information](#licensing-information) - [Citation Information](#citation-information) - [Contributions](#contributions) ## Dataset description ### Dataset Summary This dataset is a collection of annotated international addresses containing over 750,000,000 addresses from 240 countries in over 100 languages. It has been created from the data gathered and provided by [libpostal](https://github.com/openvenues/libpostal/tree/master), an international street address parsing package. The original purpose of this dataset was to develop a state-of-the-art neural network-based international address parser named [deepparse](https://github.com/GRAAL-Research/deepparse). The dataset is structured in a way that each country has its own configuration. Therefore the data can be loaded by specifying a country's `ISO 3166-1 alpha-2` code as a config name: ```python from datasets import load_dataset ds = load_dataset("deepparse/worldwide-addresses", "us") print(ds["train"][0]) ``` ### Supported Tasks and Leaderboards - `token-classification`: The dataset can be used to train models for token classification, which consists of assigning a class to each token in a text sequence. In this case, the dataset can be used to train an address parser that is able to identify the different elements of an address such as a street name or a postal code. ### Languages Each country's addresses can be expressed in multiple languages. For example, the `us` data contains addresses from the United States which can be in a language other than english (e.g: in spanish). Since this is a Parquet-based dataset and the language is included in each sample's data fields, you can specify a filter to exclusively load addresses for a specific language. This is done by specifying the language's `ISO 639-3` code like this: ```python from datasets import load_dataset lang_iso = "eng" # Only load addresses in English (eng) lang_filter = [("Language", "==", lang_iso)] ds = load_dataset("deepparse/worldwide-addresses", "us", filters=lang_filter) ``` ## Dataset Structure ### Data Instances Each sample is formatted in the following way: ``` { 'Address': 'Douglas County Minnesota 56332', 'Tags': ['County', 'County', 'Province', 'PostalCode'], 'Language': 'eng' } ``` ### Data Fields The dataset contains three fields: - `Address`: this is a String representing the address itself. There's no punctuation, so each word in the address is seperated by a whitespace. When training a model for `token-classification`, this would constitute the input. - `Tags`: these are the annotations that associate a standardized address element for each word in the address. It is a list with the same length as the number of whitespace seperated words in the `Address` field. The tags are defined as follows: - `StreetNumber`: a house or a building number. - `StreetName`: the name of the street. - `Unit`: an apartment or a unit number. - `Suburb`: an unofficial neighbourhood name. - `District`: the name of a neighbourhood which has official administrative boundaries. - `PostalCode`: standard postal code which vary per country. - `Municipality`: the name of a city. - `Province`: the name of a sub-national division within a country. - `County`: the name of a major administrative area within a province. - `Country`: the name of a country. - `Language`: the `ISO 639-3` code representing the language in which the address is written. ## Dataset Creation This dataset was curated and adapted from an international addresses dataset published by [libpostal](https://github.com/openvenues/libpostal/tree/master). For more details on the dataset creation process visit their [post](https://medium.com/@albarrentine/statistical-nlp-on-openstreetmap-part-2-80405b988718). ## Additional information ### Licensing Information This dataset is shared under a [CC-BY-4.0](https://creativecommons.org/licenses/by/4.0/) license. ### Citation Information If you use this dataset please cite the following: ``` @misc{worldwide-addresses, author = {Marouane Yassine and David Beauchemin}, title = {{Structured Multinational Address Data}}, year = {2026}, note = {\url{https://huggingface.co/datasets/deepparse/worldwide-addresses}} } ``` ### Contributions Thanks to [@albarrentine](https://github.com/albarrentine>) for creating and sharing the original dataset used to train libpostal's models, and for developping the package itself. It is quite an impressive piece of work!
提供机构:
deepparse
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作