five

mixed-address-parsing

收藏
魔搭社区2025-12-05 更新2025-09-20 收录
下载链接:
https://modelscope.cn/datasets/Josephgflowers/mixed-address-parsing
下载链接
链接失效反馈
官方服务:
资源简介:
# mixed-address-parsing "📫" ![image/png](https://cdn-uploads.huggingface.co/production/uploads/6328952f798f8d122ce62a44/PZYSrXC6aXFKqBci85SGF.png) ## Overview The **mixed-address-parsing** dataset is designed to simulate the challenges encountered when processing real-world address inputs. It contains paired examples of noisy address strings (simulating user input) and their corresponding, clean, structured JSON responses. The dataset was generated by extracting components from open geocoding data and deliberately injecting multiple types of noise to mimic common human errors and input variability. I included some the the scripts used in processing the data. ## Motivation Real-world user inputs often contain a range of imperfections such as typographical errors, random character insertions, misplaced numbers, duplicated entries, and even shuffled words. These inconsistencies can severely impact the performance of systems tasked with information extraction. This dataset was created to: - Train and evaluate models in robust address parsing. - Benchmark noise-robust natural language processing techniques. - Provide a resource for developing preprocessing pipelines to normalize messy inputs. ## Dataset Composition Each record in the dataset contains three main fields: - **system**: A system message that provides detailed instructions for extracting and formatting address information. This message is generated by randomly selecting from several variations and appending a JSON template requirement. - **user**: A noisy version of an address string. This string is built by concatenating available address components (house number, street, city, state, postal code, and country) and then applying a series of noise functions. These functions introduce typos, insert random numbers and characters, randomly add words, shuffle word order, and duplicate entries. - **assistant**: A clean, structured JSON object containing the extracted address components. The JSON object may include the following keys: - `house_number` - `street` - `city` - `postal_code` - `state` - `country` - `normalized` (a full, human-readable address) - `simplified` (a shorter version of the address) ## Data Generation Process The dataset was generated using a custom Python script that follows these key steps: 1. **Extraction of Address Components:** The script reads each row from an input CSV file, extracts the individual address elements (such as `number`, `street`, `city`, `state`, `postal_code`, and `country`), and trims extra whitespace. 2. **Construction of Clean Outputs:** Two address formats are created: - A **normalized address** that includes all available components. - A **simplified address** that focuses on the primary components. 3. **Simulated Noisy User Input:** The extracted components are concatenated to form a base string. Then, multiple noise functions are applied sequentially to simulate errors and variations: - **Typo Injection:** Randomly replaces characters with adjacent keyboard keys. - **Random Number and Character Insertion:** Adds numbers and random characters to words. - **Random Word Insertion:** Inserts words from a predefined list. - **Word Shuffling:** Randomly reorders the words. - **Entry Duplication:** Duplicates words to simulate repetition. 4. **Final Assembly:** Each processed record combines a system prompt (instruction), the noisy user input, and the clean assistant output into a final CSV format. ## Languages Covered The noise injection process makes use of an extensive list of words and phrases drawn from multiple languages. This ensures that the dataset captures a broad linguistic spectrum and includes diverse contextual noise. The languages covered include, but are not limited to: - **English** - **Mandarin Chinese (Simplified)** - **Hindi** - **Spanish** - **French** - **Arabic** - **Bengali** - **Portuguese** - **Russian** - **Japanese** - **Punjabi** - **German** - **Javanese** - **Italian** - **Korean** - **Urdu** - **Turkish** - **Vietnamese** - **Telugu** - **Marathi** - **Tamil** - **Persian** - **Thai** - **Gujarati** Additionally, the dataset includes STEM-related terms (e.g., keywords related to AI, robotics, cloud computing, etc.) to further diversify the noise and simulate domain-specific contexts. ## Intended Use Cases - **Address Parsing and Extraction:** Training systems to accurately extract structured address data from noisy and unstructured text. - **Noise Robustness Testing:** Evaluating the performance of NLP models under conditions of high variability and textual corruption. - **Multi-Lingual Data Augmentation:** Providing a resource for enhancing model training with diverse linguistic inputs. - **Data Cleaning Research:** Investigating preprocessing techniques to normalize and correct real-world user inputs. ## Limitations - **Noise Parameters:** The noise functions use fixed probability settings that may not capture the full range of errors seen in real-world data. - **Dataset Size:** The current version is based on a limited sample of 820k examples. For larger scale training, evaluation, and real world coverage the dataset may need to be expanded. - **Domain Specificity:** While the dataset includes a broad mix of languages and STEM terms, the primary focus remains on geocoded address data. ## Citation If you use this dataset in your research, please cite it as follows: ``` @misc{mixedup_addresses_dataset, title={mixed-address-parsing Dataset}, author={Joseph Flowers}, year={2025}, note={Dataset created for address parsing and noise robustness evaluation.} } ``` ## License This dataset is distributed under the [CC-BY-4.0 License](https://creativecommons.org/licenses/by/4.0/).

# 混合地址解析(mixed-address-parsing) "📫" ![image/png](https://cdn-uploads.huggingface.co/production/uploads/6328952f798f8d122ce62a44/PZYSrXC6aXFKqBci85SGF.png) ## 概述 **混合地址解析(mixed-address-parsing)** 数据集旨在模拟处理真实世界地址输入时遇到的各类挑战。该数据集包含带噪地址字符串(模拟用户输入)与其对应的结构化干净JSON响应的配对示例。数据集的构建方式为:从开放地理编码数据中提取地址组件,并刻意注入多种类型的噪声,以复现常见的人为输入错误与输入变异。本仓库还附带了部分数据处理所用的脚本。 ## 研究动机 真实世界的用户输入往往存在各类缺陷,例如拼写错误、随机字符插入、数字错位、条目重复,甚至词汇顺序混乱。这类不一致性会严重影响信息抽取系统的性能。本数据集的创建目标如下: - 训练并评估鲁棒性地址解析模型; - 为噪声鲁棒自然语言处理(Natural Language Processing, NLP)技术提供基准测试; - 为开发用于规范化杂乱输入的预处理流水线提供资源。 ## 数据集构成 数据集中的每条记录包含三个核心字段: - **system(系统提示)**: 用于提供地址信息抽取与格式化的详细指令。该提示会从多个预设变体中随机选取,并附加JSON模板要求。 - **user(用户输入)**: 带噪的地址字符串。该字符串由可用的地址组件(门牌号、街道、城市、州/省、邮政编码、国家)拼接而成,随后通过一系列噪声函数处理生成。这些函数会引入拼写错误、插入随机数字与字符、随机添加词汇、打乱词汇顺序,以及重复条目。 - **assistant(助手回复)**: 结构化的干净JSON对象,包含抽取得到的地址组件。该JSON对象可包含以下键: - `house_number`(门牌号) - `street`(街道) - `city`(城市) - `postal_code`(邮政编码) - `state`(州/省) - `country`(国家) - `normalized`(完整可读地址) - `simplified`(简化版地址) ## 数据生成流程 本数据集通过自定义Python脚本生成,核心步骤如下: 1. **地址组件抽取**: 脚本读取输入CSV文件的每一行,提取其中的各个地址元素(如`number`、`street`、`city`、`state`、`postal_code`、`country`),并去除多余空白字符。 2. **干净输出构建**: 生成两种地址格式: - **规范化地址**:包含所有可用的地址组件; - **简化地址**:仅保留核心地址组件。 3. **模拟带噪用户输入**: 将提取的地址组件拼接为基础字符串,随后依次应用多种噪声函数以模拟各类错误与变异: - **拼写错误注入**:随机将字符替换为键盘上相邻的按键字符; - **随机数字与字符插入**:向词汇中插入随机数字与字符; - **随机词汇插入**:从预定义词汇表中插入额外词汇; - **词汇重排**:随机打乱词汇顺序; - **条目重复**:重复部分词汇以模拟输入冗余。 4. **最终组装**: 将处理后的每条记录整合为系统提示、带噪用户输入与干净助手回复的组合,并最终导出为CSV格式。 ## 覆盖语言 噪声注入过程使用了源自多种语言的海量词汇与短语,确保数据集覆盖广泛的语言谱系,并包含多样化的上下文噪声。覆盖的语言包括但不限于: - 英语 - 简体中文(Mandarin Chinese (Simplified)) - 印地语 - 西班牙语 - 法语 - 阿拉伯语 - 孟加拉语 - 葡萄牙语 - 俄语 - 日语 - 旁遮普语 - 德语 - 爪哇语 - 意大利语 - 韩语 - 乌尔都语 - 土耳其语 - 越南语 - 泰卢固语 - 马拉地语 - 泰米尔语 - 波斯语 - 泰语 - 古吉拉特语 此外,数据集还包含STEM相关术语(例如与人工智能、机器人学、云计算等相关的关键词),以进一步丰富噪声类型,并模拟特定领域的上下文场景。 ## 预期应用场景 - **地址解析与抽取**: 训练系统以从杂乱无结构文本中精准抽取结构化地址数据。 - **噪声鲁棒性测试**: 评估自然语言处理模型在高文本变异与文本损坏场景下的性能表现。 - **多语言数据增强**: 为使用多样化语言输入增强模型训练提供资源。 - **数据清洗研究**: 探索用于规范化与修正真实世界用户输入的预处理技术。 ## 数据集局限性 - **噪声参数固定**: 噪声函数使用固定概率设置,可能无法完全复现真实世界中出现的各类错误。 - **数据集规模有限**: 当前版本仅包含82万个示例样本。若需用于大规模训练、评估与真实场景覆盖,该数据集需要进一步扩充。 - **领域针对性**: 尽管数据集涵盖了广泛的语言与STEM术语,但其核心仍聚焦于地理编码地址数据。 ## 引用方式 若您在研究中使用本数据集,请按照以下格式引用: @misc{mixedup_addresses_dataset, title={mixed-address-parsing Dataset}, author={Joseph Flowers}, year={2025}, note={Dataset created for address parsing and noise robustness evaluation.} } ## 授权协议 本数据集采用[CC-BY-4.0协议](https://creativecommons.org/licenses/by/4.0/)进行分发。
提供机构:
maas
创建时间:
2025-08-31
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作