CCKS 2021 地址解析数据集
收藏魔搭社区2026-05-15 更新2024-05-15 收录
下载链接:
https://modelscope.cn/datasets/iic/address-ner-ccks-2021
下载链接
链接失效反馈官方服务:
资源简介:
# CCKS 2021 地址解析数据集
## 数据集概述
本数据集来源于[CCKS2021 中文地址解析学术评测任务](https://tianchi.aliyun.com/competition/entrance/531900/introduction),由阿里巴巴达摩院提供,数据集开放的目标是促进NLP社区对中文地址解析技术的研究。
### 数据集简介
人类的活动离不开位置,从空间上可以表征为坐标,从文本上表征为通讯地址。通讯地址广泛存在于电商物流、政府登记、金融交通等领域。对通讯地址的分析、聚合服务已经是一项重要基础服务,支撑着诸多互联网场景,比如地图搜索、电商物流分析等。实际应用中,地址文本存在写法自由、缺省别名多、地域性强等特点,对地址的解析、归一和匹配等都造成困难。针对这些难点, CCKS2021大会举办此次地址评测任务。评测包含2个子任务,分别是:中文地址要素解析、地址文本相关性。本数据集来源于地址要素解析任务,包括了初赛的训练集、验证集和测试集,地址数据通过抓取公开的地址信息(如黄页网站等)获得, 通过开放较大规模的标注语料,希望和社区共同推动地址文本处理领域的发展。
### 任务定义
地址是日常生活中一种重要的文本信息,诸多场景需要登记地址,如电商购物、外卖配送、人口普查、水电气开户等。常见的地址一般包含以下几类信息:
行政区划信息,如省、市、县、乡镇信息;
路网信息,如路名,路号,道路设施等;
详细地址信息,如POI (兴趣点)、楼栋号、户室号等;
非地址信息,如补充说明,误输入等;
地址要素解析是将地址文本拆分成独立语义的要素,并对这些要素进行类型识别的过程。中文地址要素解析任务的目标即将一条地址分解为上述几个部分的详细标签,如:
输入:浙江省杭州市余杭区五常街道文一西路969号淘宝城5号楼,放前台
输出:Province=浙江省 city=杭州市 district=余杭区 town=五常街道 road=文一西路road_number=969号 poi=淘宝城 house_number=5号楼 other=,放前台
### 数据集的格式和结构
数据格式采用conll标准,数据包括两列,以制表符`\t`分隔,第一列输入句中的词划分以及最后一列中每个词对应的地址元素类型标签。具体例子如下:
```
杭 B-city
州 E-city
五 B-poi
洲 I-poi
国 I-poi
际 E-poi
浙 B-prov
江 I-prov
省 E-prov
杭 B-city
州 I-city
市 E-city
余 B-district
杭 E-district
乔 B-town
司 I-town
街 I-town
道 E-town
博 B-road
卡 I-road
路 E-road
0 B-roadno
号 E-roadno
博 B-poi
卡 I-poi
制 I-poi
衣 E-poi
```
## 数据集版权信息
Creative Commons Attribution 4.0 International。
## 引用方式
```bib
@inproceedings{zhang-etal-2022-domain,
title = "Domain-Specific {NER} via Retrieving Correlated Samples",
author = "Zhang, Xin and
Jiang, Yong and
Wang, Xiaobin and
Hu, Xuming and
Sun, Yueheng and
Xie, Pengjun and
Zhang, Meishan",
booktitle = "Proceedings of the 29th International Conference on Computational Linguistics",
month = oct,
year = "2022",
address = "Gyeongju, Republic of Korea",
publisher = "International Committee on Computational Linguistics",
url = "https://aclanthology.org/2022.coling-1.211",
pages = "2398--2404",
}
```
# CCKS 2021 Address Parsing Dataset
## Dataset Overview
This dataset is sourced from the [CCKS 2021 Chinese Address Parsing Academic Evaluation Task](https://tianchi.aliyun.com/competition/entrance/531900/introduction), provided by Alibaba DAMO Academy. The goal of releasing this dataset is to facilitate research on Chinese address parsing technologies within the NLP community.
## Dataset Introduction
Human activities are inseparable from location information, which can be expressed as spatial coordinates or textual mailing addresses. Mailing addresses are widely used in e-commerce logistics, government registration, finance and transportation, and other fields. Analysis and aggregation services for mailing addresses have become a critical basic service, supporting various Internet scenarios such as map search and e-commerce logistics analysis.
In practical applications, address texts often exhibit characteristics such as free writing styles, numerous default aliases, and strong regionality, which pose challenges to address parsing, normalization, matching, and related tasks. To address these challenges, the CCKS 2021 conference launched this address evaluation task, which consists of two subtasks: Chinese Address Element Parsing and Address Text Relevance. This dataset is derived from the address element parsing subtask, covering the training, validation, and test sets of the preliminary round. The address data was collected by crawling publicly available address information (e.g., yellow page websites). By releasing a large-scale annotated corpus, we aim to jointly advance the development of the address text processing field with the NLP community.
### Task Definition
Address is an important type of textual information in daily life. Address registration is required in many scenarios, such as e-commerce shopping, food delivery, census, and utility account opening.
Typical addresses generally contain the following categories of information:
- Administrative division information, including province, city, county, and township details;
- Road network information, including road names, road numbers, road facilities, etc.;
- Detailed address information, such as POI (Point of Interest), building numbers, room numbers, etc.;
- Non-address information, such as supplementary notes, incorrect inputs, etc.;
Address element parsing refers to the process of splitting an address text into semantically independent elements and identifying the type of each element. The goal of the Chinese address element parsing task is to decompose an address into detailed labels corresponding to the aforementioned categories, for example:
Input: Zhejiang Province, Hangzhou City, Yuhang District, Wuchang Subdistrict, Wenyi West Road No. 969, Taobao City, Building 5, place at the front desk
Output: Province=Zhejiang Province, city=Hangzhou City, district=Yuhang District, town=Wuchang Subdistrict, road=Wenyi West Road, road_number=No. 969, poi=Taobao City, house_number=Building 5, other=, place at the front desk
### Dataset Format and Structure
The data follows the CoNLL standard, with two columns separated by a tab character ` `. The first column contains the word segmentation of the input sentence, and the second column contains the address element type label corresponding to each word. A specific example is as follows:
杭 B-city
州 E-city
五 B-poi
洲 I-poi
国 I-poi
际 E-poi
浙 B-prov
江 I-prov
省 E-prov
杭 B-city
州 I-city
市 E-city
余 B-district
杭 E-district
乔 B-town
司 I-town
街 I-town
道 E-town
博 B-road
卡 I-road
路 E-road
0 B-roadno
号 E-roadno
博 B-poi
卡 I-poi
制 I-poi
衣 E-poi
## Dataset Copyright Information
Creative Commons Attribution 4.0 International
## Citation
bib
@inproceedings{zhang-etal-2022-domain,
title = "Domain-Specific {NER} via Retrieving Correlated Samples",
author = "Zhang, Xin and
Jiang, Yong and
Wang, Xiaobin and
Hu, Xuming and
Sun, Yueheng and
Xie, Pengjun and
Zhang, Meishan",
booktitle = "Proceedings of the 29th International Conference on Computational Linguistics",
month = oct,
year = "2022",
address = "Gyeongju, Republic of Korea",
publisher = "International Committee on Computational Linguistics",
url = "https://aclanthology.org/2022.coling-1.211",
pages = "2398--2404",
}
提供机构:
maas
创建时间:
2023-01-05
搜集汇总
数据集介绍

背景与挑战
背景概述
该数据集是CCKS2021中文地址解析评测任务的数据集,包含地址元素解析任务的分词和标签数据,旨在促进中文地址处理技术的研究。数据格式为conll标准,适用于自然语言处理领域的地址解析任务。
以上内容由遇见数据集搜集并总结生成



