deepparse/worldwide-addresses
收藏Hugging Face2026-01-17 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/deepparse/worldwide-addresses
下载链接
链接失效反馈官方服务:
资源简介:
---
configs:
- config_name: wf
data_files:
- split: train
path:
- "wf/chunk-0.parquet"
- config_name: lb
data_files:
- split: train
path:
- "lb/chunk-0.parquet"
- config_name: ke
data_files:
- split: train
path:
- "ke/chunk-0.parquet"
- config_name: gr
data_files:
- split: train
path:
- "gr/chunk-0.parquet"
- config_name: lc
data_files:
- split: train
path:
- "lc/chunk-0.parquet"
- config_name: ni
data_files:
- split: train
path:
- "ni/chunk-0.parquet"
- config_name: ax
data_files:
- split: train
path:
- "ax/chunk-0.parquet"
- config_name: iq
data_files:
- split: train
path:
- "iq/chunk-0.parquet"
- config_name: eg
data_files:
- split: train
path:
- "eg/chunk-0.parquet"
- config_name: ky
data_files:
- split: train
path:
- "ky/chunk-0.parquet"
- config_name: us
data_files:
- split: train
path:
- "us/chunk-0.parquet"
- "us/chunk-1.parquet"
- "us/chunk-2.parquet"
- "us/chunk-3.parquet"
- config_name: ar
data_files:
- split: train
path:
- "ar/chunk-0.parquet"
- config_name: gm
data_files:
- split: train
path:
- "gm/chunk-0.parquet"
- config_name: rw
data_files:
- split: train
path:
- "rw/chunk-0.parquet"
- config_name: sk
data_files:
- split: train
path:
- "sk/chunk-0.parquet"
- config_name: mx
data_files:
- split: train
path:
- "mx/chunk-0.parquet"
- config_name: km
data_files:
- split: train
path:
- "km/chunk-0.parquet"
- config_name: sz
data_files:
- split: train
path:
- "sz/chunk-0.parquet"
- config_name: hn
data_files:
- split: train
path:
- "hn/chunk-0.parquet"
- config_name: im
data_files:
- split: train
path:
- "im/chunk-0.parquet"
- config_name: as
data_files:
- split: train
path:
- "as/chunk-0.parquet"
- config_name: cr
data_files:
- split: train
path:
- "cr/chunk-0.parquet"
- config_name: fr
data_files:
- split: train
path:
- "fr/chunk-0.parquet"
- config_name: io
data_files:
- split: train
path:
- "io/chunk-0.parquet"
- config_name: do
data_files:
- split: train
path:
- "do/chunk-0.parquet"
- config_name: bz
data_files:
- split: train
path:
- "bz/chunk-0.parquet"
- config_name: co
data_files:
- split: train
path:
- "co/chunk-0.parquet"
- config_name: ro
data_files:
- split: train
path:
- "ro/chunk-0.parquet"
- config_name: fk
data_files:
- split: train
path:
- "fk/chunk-0.parquet"
- config_name: mw
data_files:
- split: train
path:
- "mw/chunk-0.parquet"
- config_name: la
data_files:
- split: train
path:
- "la/chunk-0.parquet"
- config_name: lu
data_files:
- split: train
path:
- "lu/chunk-0.parquet"
- config_name: pf
data_files:
- split: train
path:
- "pf/chunk-0.parquet"
- config_name: pk
data_files:
- split: train
path:
- "pk/chunk-0.parquet"
- config_name: mq
data_files:
- split: train
path:
- "mq/chunk-0.parquet"
- config_name: vn
data_files:
- split: train
path:
- "vn/chunk-0.parquet"
- config_name: sh
data_files:
- split: train
path:
- "sh/chunk-0.parquet"
- config_name: li
data_files:
- split: train
path:
- "li/chunk-0.parquet"
- config_name: vc
data_files:
- split: train
path:
- "vc/chunk-0.parquet"
- config_name: mv
data_files:
- split: train
path:
- "mv/chunk-0.parquet"
- config_name: re
data_files:
- split: train
path:
- "re/chunk-0.parquet"
- config_name: bo
data_files:
- split: train
path:
- "bo/chunk-0.parquet"
- config_name: vg
data_files:
- split: train
path:
- "vg/chunk-0.parquet"
- config_name: id
data_files:
- split: train
path:
- "id/chunk-0.parquet"
- config_name: ge
data_files:
- split: train
path:
- "ge/chunk-0.parquet"
- config_name: dz
data_files:
- split: train
path:
- "dz/chunk-0.parquet"
- config_name: np
data_files:
- split: train
path:
- "np/chunk-0.parquet"
- config_name: bq
data_files:
- split: train
path:
- "bq/chunk-0.parquet"
- config_name: pa
data_files:
- split: train
path:
- "pa/chunk-0.parquet"
- config_name: cz
data_files:
- split: train
path:
- "cz/chunk-0.parquet"
- config_name: bs
data_files:
- split: train
path:
- "bs/chunk-0.parquet"
- config_name: pl
data_files:
- split: train
path:
- "pl/chunk-0.parquet"
- config_name: ec
data_files:
- split: train
path:
- "ec/chunk-0.parquet"
- config_name: vu
data_files:
- split: train
path:
- "vu/chunk-0.parquet"
- config_name: ie
data_files:
- split: train
path:
- "ie/chunk-0.parquet"
- config_name: kp
data_files:
- split: train
path:
- "kp/chunk-0.parquet"
- config_name: pn
data_files:
- split: train
path:
- "pn/chunk-0.parquet"
- config_name: bm
data_files:
- split: train
path:
- "bm/chunk-0.parquet"
- config_name: my
data_files:
- split: train
path:
- "my/chunk-0.parquet"
- config_name: mz
data_files:
- split: train
path:
- "mz/chunk-0.parquet"
- config_name: br
data_files:
- split: train
path:
- "br/chunk-0.parquet"
- "br/chunk-1.parquet"
- config_name: py
data_files:
- split: train
path:
- "py/chunk-0.parquet"
- config_name: ps
data_files:
- split: train
path:
- "ps/chunk-0.parquet"
- config_name: mr
data_files:
- split: train
path:
- "mr/chunk-0.parquet"
- config_name: pt
data_files:
- split: train
path:
- "pt/chunk-0.parquet"
- config_name: cd
data_files:
- split: train
path:
- "cd/chunk-0.parquet"
- config_name: uy
data_files:
- split: train
path:
- "uy/chunk-0.parquet"
- config_name: hk
data_files:
- split: train
path:
- "hk/chunk-0.parquet"
- config_name: et
data_files:
- split: train
path:
- "et/chunk-0.parquet"
- config_name: dk
data_files:
- split: train
path:
- "dk/chunk-0.parquet"
- config_name: vi
data_files:
- split: train
path:
- "vi/chunk-0.parquet"
- config_name: mf
data_files:
- split: train
path:
- "mf/chunk-0.parquet"
- config_name: ao
data_files:
- split: train
path:
- "ao/chunk-0.parquet"
- config_name: hu
data_files:
- split: train
path:
- "hu/chunk-0.parquet"
- config_name: nz
data_files:
- split: train
path:
- "nz/chunk-0.parquet"
- config_name: mc
data_files:
- split: train
path:
- "mc/chunk-0.parquet"
- config_name: az
data_files:
- split: train
path:
- "az/chunk-0.parquet"
- config_name: cc
data_files:
- split: train
path:
- "cc/chunk-0.parquet"
- config_name: ht
data_files:
- split: train
path:
- "ht/chunk-0.parquet"
- config_name: so
data_files:
- split: train
path:
- "so/chunk-0.parquet"
- config_name: nc
data_files:
- split: train
path:
- "nc/chunk-0.parquet"
- config_name: mg
data_files:
- split: train
path:
- "mg/chunk-0.parquet"
- config_name: rs
data_files:
- split: train
path:
- "rs/chunk-0.parquet"
- config_name: au
data_files:
- split: train
path:
- "au/chunk-0.parquet"
- config_name: ly
data_files:
- split: train
path:
- "ly/chunk-0.parquet"
- config_name: ph
data_files:
- split: train
path:
- "ph/chunk-0.parquet"
- config_name: aw
data_files:
- split: train
path:
- "aw/chunk-0.parquet"
- config_name: va
data_files:
- split: train
path:
- "va/chunk-0.parquet"
- config_name: tz
data_files:
- split: train
path:
- "tz/chunk-0.parquet"
- config_name: it
data_files:
- split: train
path:
- "it/chunk-0.parquet"
- config_name: tt
data_files:
- split: train
path:
- "tt/chunk-0.parquet"
- config_name: bg
data_files:
- split: train
path:
- "bg/chunk-0.parquet"
- config_name: gl
data_files:
- split: train
path:
- "gl/chunk-0.parquet"
- config_name: sb
data_files:
- split: train
path:
- "sb/chunk-0.parquet"
- config_name: bn
data_files:
- split: train
path:
- "bn/chunk-0.parquet"
- config_name: bf
data_files:
- split: train
path:
- "bf/chunk-0.parquet"
- config_name: lt
data_files:
- split: train
path:
- "lt/chunk-0.parquet"
- config_name: om
data_files:
- split: train
path:
- "om/chunk-0.parquet"
- config_name: gy
data_files:
- split: train
path:
- "gy/chunk-0.parquet"
- config_name: tj
data_files:
- split: train
path:
- "tj/chunk-0.parquet"
- config_name: tc
data_files:
- split: train
path:
- "tc/chunk-0.parquet"
- config_name: qa
data_files:
- split: train
path:
- "qa/chunk-0.parquet"
- config_name: gp
data_files:
- split: train
path:
- "gp/chunk-0.parquet"
- config_name: gq
data_files:
- split: train
path:
- "gq/chunk-0.parquet"
- config_name: za
data_files:
- split: train
path:
- "za/chunk-0.parquet"
- config_name: cn
data_files:
- split: train
path:
- "cn/chunk-0.parquet"
- config_name: tf
data_files:
- split: train
path:
- "tf/chunk-0.parquet"
- config_name: st
data_files:
- split: train
path:
- "st/chunk-0.parquet"
- config_name: dj
data_files:
- split: train
path:
- "dj/chunk-0.parquet"
- config_name: mh
data_files:
- split: train
path:
- "mh/chunk-0.parquet"
- config_name: ag
data_files:
- split: train
path:
- "ag/chunk-0.parquet"
- config_name: sy
data_files:
- split: train
path:
- "sy/chunk-0.parquet"
- config_name: ci
data_files:
- split: train
path:
- "ci/chunk-0.parquet"
- config_name: ga
data_files:
- split: train
path:
- "ga/chunk-0.parquet"
- config_name: ai
data_files:
- split: train
path:
- "ai/chunk-0.parquet"
- config_name: kw
data_files:
- split: train
path:
- "kw/chunk-0.parquet"
- config_name: ir
data_files:
- split: train
path:
- "ir/chunk-0.parquet"
- config_name: ng
data_files:
- split: train
path:
- "ng/chunk-0.parquet"
- config_name: zw
data_files:
- split: train
path:
- "zw/chunk-0.parquet"
- config_name: sd
data_files:
- split: train
path:
- "sd/chunk-0.parquet"
- config_name: bw
data_files:
- split: train
path:
- "bw/chunk-0.parquet"
- config_name: sa
data_files:
- split: train
path:
- "sa/chunk-0.parquet"
- config_name: sv
data_files:
- split: train
path:
- "sv/chunk-0.parquet"
- config_name: al
data_files:
- split: train
path:
- "al/chunk-0.parquet"
- config_name: md
data_files:
- split: train
path:
- "md/chunk-0.parquet"
- config_name: kz
data_files:
- split: train
path:
- "kz/chunk-0.parquet"
- config_name: tr
data_files:
- split: train
path:
- "tr/chunk-0.parquet"
- config_name: gb
data_files:
- split: train
path:
- "gb/chunk-0.parquet"
- "gb/chunk-1.parquet"
- config_name: cg
data_files:
- split: train
path:
- "cg/chunk-0.parquet"
- config_name: ve
data_files:
- split: train
path:
- "ve/chunk-0.parquet"
- config_name: cm
data_files:
- split: train
path:
- "cm/chunk-0.parquet"
- config_name: ca
data_files:
- split: train
path:
- "ca/chunk-0.parquet"
- "ca/chunk-1.parquet"
- config_name: mt
data_files:
- split: train
path:
- "mt/chunk-0.parquet"
- config_name: ba
data_files:
- split: train
path:
- "ba/chunk-0.parquet"
- config_name: sn
data_files:
- split: train
path:
- "sn/chunk-0.parquet"
- config_name: ne
data_files:
- split: train
path:
- "ne/chunk-0.parquet"
- config_name: fj
data_files:
- split: train
path:
- "fj/chunk-0.parquet"
- config_name: ki
data_files:
- split: train
path:
- "ki/chunk-0.parquet"
- config_name: si
data_files:
- split: train
path:
- "si/chunk-0.parquet"
- config_name: nf
data_files:
- split: train
path:
- "nf/chunk-0.parquet"
- config_name: sg
data_files:
- split: train
path:
- "sg/chunk-0.parquet"
- config_name: tv
data_files:
- split: train
path:
- "tv/chunk-0.parquet"
- config_name: bj
data_files:
- split: train
path:
- "bj/chunk-0.parquet"
- config_name: ss
data_files:
- split: train
path:
- "ss/chunk-0.parquet"
- config_name: mp
data_files:
- split: train
path:
- "mp/chunk-0.parquet"
- config_name: ml
data_files:
- split: train
path:
- "ml/chunk-0.parquet"
- config_name: tn
data_files:
- split: train
path:
- "tn/chunk-0.parquet"
- config_name: jm
data_files:
- split: train
path:
- "jm/chunk-0.parquet"
- config_name: es
data_files:
- split: train
path:
- "es/chunk-0.parquet"
- config_name: de
data_files:
- split: train
path:
- "de/chunk-0.parquet"
- "de/chunk-1.parquet"
- config_name: cf
data_files:
- split: train
path:
- "cf/chunk-0.parquet"
- config_name: tw
data_files:
- split: train
path:
- "tw/chunk-0.parquet"
- config_name: zm
data_files:
- split: train
path:
- "zm/chunk-0.parquet"
- config_name: ch
data_files:
- split: train
path:
- "ch/chunk-0.parquet"
- config_name: lv
data_files:
- split: train
path:
- "lv/chunk-0.parquet"
- config_name: ua
data_files:
- split: train
path:
- "ua/chunk-0.parquet"
- config_name: kr
data_files:
- split: train
path:
- "kr/chunk-0.parquet"
- config_name: gu
data_files:
- split: train
path:
- "gu/chunk-0.parquet"
- config_name: cl
data_files:
- split: train
path:
- "cl/chunk-0.parquet"
- config_name: kh
data_files:
- split: train
path:
- "kh/chunk-0.parquet"
- config_name: ls
data_files:
- split: train
path:
- "ls/chunk-0.parquet"
- config_name: mu
data_files:
- split: train
path:
- "mu/chunk-0.parquet"
- config_name: nu
data_files:
- split: train
path:
- "nu/chunk-0.parquet"
- config_name: gd
data_files:
- split: train
path:
- "gd/chunk-0.parquet"
- config_name: um
data_files:
- split: train
path:
- "um/chunk-0.parquet"
- config_name: in
data_files:
- split: train
path:
- "in/chunk-0.parquet"
- config_name: sr
data_files:
- split: train
path:
- "sr/chunk-0.parquet"
- config_name: td
data_files:
- split: train
path:
- "td/chunk-0.parquet"
- config_name: ad
data_files:
- split: train
path:
- "ad/chunk-0.parquet"
- config_name: se
data_files:
- split: train
path:
- "se/chunk-0.parquet"
- config_name: sl
data_files:
- split: train
path:
- "sl/chunk-0.parquet"
- config_name: gf
data_files:
- split: train
path:
- "gf/chunk-0.parquet"
- config_name: yt
data_files:
- split: train
path:
- "yt/chunk-0.parquet"
- config_name: fm
data_files:
- split: train
path:
- "fm/chunk-0.parquet"
- config_name: am
data_files:
- split: train
path:
- "am/chunk-0.parquet"
- config_name: sc
data_files:
- split: train
path:
- "sc/chunk-0.parquet"
- config_name: bd
data_files:
- split: train
path:
- "bd/chunk-0.parquet"
- config_name: tl
data_files:
- split: train
path:
- "tl/chunk-0.parquet"
- config_name: kg
data_files:
- split: train
path:
- "kg/chunk-0.parquet"
- config_name: ye
data_files:
- split: train
path:
- "ye/chunk-0.parquet"
- config_name: kn
data_files:
- split: train
path:
- "kn/chunk-0.parquet"
- config_name: pe
data_files:
- split: train
path:
- "pe/chunk-0.parquet"
- config_name: at
data_files:
- split: train
path:
- "at/chunk-0.parquet"
- config_name: tg
data_files:
- split: train
path:
- "tg/chunk-0.parquet"
- config_name: pm
data_files:
- split: train
path:
- "pm/chunk-0.parquet"
- config_name: me
data_files:
- split: train
path:
- "me/chunk-0.parquet"
- config_name: 'no'
data_files:
- split: train
path:
- "no/chunk-0.parquet"
- config_name: gh
data_files:
- split: train
path:
- "gh/chunk-0.parquet"
- config_name: bh
data_files:
- split: train
path:
- "bh/chunk-0.parquet"
- config_name: ws
data_files:
- split: train
path:
- "ws/chunk-0.parquet"
- config_name: nl
data_files:
- split: train
path:
- "nl/chunk-0.parquet"
- config_name: is
data_files:
- split: train
path:
- "is/chunk-0.parquet"
- config_name: lk
data_files:
- split: train
path:
- "lk/chunk-0.parquet"
- config_name: fi
data_files:
- split: train
path:
- "fi/chunk-0.parquet"
- config_name: bt
data_files:
- split: train
path:
- "bt/chunk-0.parquet"
- config_name: gn
data_files:
- split: train
path:
- "gn/chunk-0.parquet"
- config_name: cx
data_files:
- split: train
path:
- "cx/chunk-0.parquet"
- config_name: cv
data_files:
- split: train
path:
- "cv/chunk-0.parquet"
- config_name: mn
data_files:
- split: train
path:
- "mn/chunk-0.parquet"
- config_name: mm
data_files:
- split: train
path:
- "mm/chunk-0.parquet"
- config_name: bl
data_files:
- split: train
path:
- "bl/chunk-0.parquet"
- config_name: af
data_files:
- split: train
path:
- "af/chunk-0.parquet"
- config_name: ee
data_files:
- split: train
path:
- "ee/chunk-0.parquet"
- config_name: mo
data_files:
- split: train
path:
- "mo/chunk-0.parquet"
- config_name: cu
data_files:
- split: train
path:
- "cu/chunk-0.parquet"
- config_name: er
data_files:
- split: train
path:
- "er/chunk-0.parquet"
- config_name: lr
data_files:
- split: train
path:
- "lr/chunk-0.parquet"
- config_name: sx
data_files:
- split: train
path:
- "sx/chunk-0.parquet"
- config_name: uz
data_files:
- split: train
path:
- "uz/chunk-0.parquet"
- config_name: dm
data_files:
- split: train
path:
- "dm/chunk-0.parquet"
- config_name: ms
data_files:
- split: train
path:
- "ms/chunk-0.parquet"
- config_name: to
data_files:
- split: train
path:
- "to/chunk-0.parquet"
- config_name: pw
data_files:
- split: train
path:
- "pw/chunk-0.parquet"
- config_name: na
data_files:
- split: train
path:
- "na/chunk-0.parquet"
- config_name: pg
data_files:
- split: train
path:
- "pg/chunk-0.parquet"
- config_name: be
data_files:
- split: train
path:
- "be/chunk-0.parquet"
- config_name: bb
data_files:
- split: train
path:
- "bb/chunk-0.parquet"
- config_name: gg
data_files:
- split: train
path:
- "gg/chunk-0.parquet"
- config_name: th
data_files:
- split: train
path:
- "th/chunk-0.parquet"
- config_name: ae
data_files:
- split: train
path:
- "ae/chunk-0.parquet"
- config_name: mk
data_files:
- split: train
path:
- "mk/chunk-0.parquet"
- config_name: ck
data_files:
- split: train
path:
- "ck/chunk-0.parquet"
- config_name: hr
data_files:
- split: train
path:
- "hr/chunk-0.parquet"
- config_name: ug
data_files:
- split: train
path:
- "ug/chunk-0.parquet"
- config_name: il
data_files:
- split: train
path:
- "il/chunk-0.parquet"
- config_name: fo
data_files:
- split: train
path:
- "fo/chunk-0.parquet"
- config_name: ru
data_files:
- split: train
path:
- "ru/chunk-0.parquet"
- config_name: jo
data_files:
- split: train
path:
- "jo/chunk-0.parquet"
- config_name: tm
data_files:
- split: train
path:
- "tm/chunk-0.parquet"
- config_name: jp
data_files:
- split: train
path:
- "jp/chunk-0.parquet"
- config_name: gt
data_files:
- split: train
path:
- "gt/chunk-0.parquet"
- config_name: gw
data_files:
- split: train
path:
- "gw/chunk-0.parquet"
- config_name: cy
data_files:
- split: train
path:
- "cy/chunk-0.parquet"
- config_name: cw
data_files:
- split: train
path:
- "cw/chunk-0.parquet"
- config_name: sm
data_files:
- split: train
path:
- "sm/chunk-0.parquet"
- config_name: nr
data_files:
- split: train
path:
- "nr/chunk-0.parquet"
- config_name: ma
data_files:
- split: train
path:
- "ma/chunk-0.parquet"
- config_name: bi
data_files:
- split: train
path:
- "bi/chunk-0.parquet"
- config_name: by
data_files:
- split: train
path:
- "by/chunk-0.parquet"
- config_name: pr
data_files:
- split: train
path:
- "pr/chunk-0.parquet"
task_categories:
- token-classification
license: cc-by-4.0
---
# Dataset Card for worldwide-addresses
## Table of Contents
- [Table of Contents](#table-of-contents)
- [Dataset Description](#dataset-description)
- [Dataset Summary](#dataset-summary)
- [Supported Tasks and Leaderboards](#supported-tasks-and-leaderboards)
- [Languages](#languages)
- [Dataset Structure](#dataset-structure)
- [Data Instances](#data-instances)
- [Data Fields](#data-fields)
- [Dataset Creation](#dataset-creation)
- [Additional Information](#additional-information)
- [Licensing Information](#licensing-information)
- [Citation Information](#citation-information)
- [Contributions](#contributions)
## Dataset description
### Dataset Summary
This dataset is a collection of annotated international addresses containing over 750,000,000 addresses from 240 countries in over 100 languages. It has been created from the data gathered and provided by [libpostal](https://github.com/openvenues/libpostal/tree/master), an international street address parsing package. The original purpose of this dataset was to develop a state-of-the-art neural network-based international address parser named [deepparse](https://github.com/GRAAL-Research/deepparse).
The dataset is structured in a way that each country has its own configuration. Therefore the data can be loaded by specifying a country's `ISO 3166-1 alpha-2` code as a config name:
```python
from datasets import load_dataset
ds = load_dataset("deepparse/worldwide-addresses", "us")
print(ds["train"][0])
```
### Supported Tasks and Leaderboards
- `token-classification`: The dataset can be used to train models for token classification, which consists of assigning a class to each token in a text sequence. In this case, the dataset can be used to train an address parser that is able to identify the different elements of an address such as a street name or a postal code.
### Languages
Each country's addresses can be expressed in multiple languages. For example, the `us` data contains addresses from the United States which can be in a language other than english (e.g: in spanish). Since this is a Parquet-based dataset and the language is included in each sample's data fields, you can specify a filter to exclusively load addresses for a specific language. This is done by specifying the language's `ISO 639-3` code like this:
```python
from datasets import load_dataset
lang_iso = "eng"
# Only load addresses in English (eng)
lang_filter = [("Language", "==", lang_iso)]
ds = load_dataset("deepparse/worldwide-addresses", "us", filters=lang_filter)
```
## Dataset Structure
### Data Instances
Each sample is formatted in the following way:
```
{
'Address': 'Douglas County Minnesota 56332',
'Tags': ['County', 'County', 'Province', 'PostalCode'],
'Language': 'eng'
}
```
### Data Fields
The dataset contains three fields:
- `Address`: this is a String representing the address itself. There's no punctuation, so each word in the address is seperated by a whitespace. When training a model for `token-classification`, this would constitute the input.
- `Tags`: these are the annotations that associate a standardized address element for each word in the address. It is a list with the same length as the number of whitespace seperated words in the `Address` field. The tags are defined as follows:
- `StreetNumber`: a house or a building number.
- `StreetName`: the name of the street.
- `Unit`: an apartment or a unit number.
- `Suburb`: an unofficial neighbourhood name.
- `District`: the name of a neighbourhood which has official administrative boundaries.
- `PostalCode`: standard postal code which vary per country.
- `Municipality`: the name of a city.
- `Province`: the name of a sub-national division within a country.
- `County`: the name of a major administrative area within a province.
- `Country`: the name of a country.
- `Language`: the `ISO 639-3` code representing the language in which the address is written.
## Dataset Creation
This dataset was curated and adapted from an international addresses dataset published by [libpostal](https://github.com/openvenues/libpostal/tree/master). For more details on the dataset creation process visit their [post](https://medium.com/@albarrentine/statistical-nlp-on-openstreetmap-part-2-80405b988718).
## Additional information
### Licensing Information
This dataset is shared under a [CC-BY-4.0](https://creativecommons.org/licenses/by/4.0/) license.
### Citation Information
If you use this dataset please cite the following:
```
@misc{worldwide-addresses,
author = {Marouane Yassine and David Beauchemin},
title = {{Structured Multinational Address Data}},
year = {2026},
note = {\url{https://huggingface.co/datasets/deepparse/worldwide-addresses}}
}
```
### Contributions
Thanks to [@albarrentine](https://github.com/albarrentine>) for creating and sharing the original dataset used to train libpostal's models, and for developping the package itself. It is quite an impressive piece of work!
提供机构:
deepparse



