Josephgflowers/mixed-address-parsing
收藏Hugging Face2025-04-08 更新2025-04-12 收录
下载链接:
https://hf-mirror.com/datasets/Josephgflowers/mixed-address-parsing
下载链接
链接失效反馈官方服务:
资源简介:
mixed-address-parsing数据集旨在模拟处理现实世界中地址输入时遇到的挑战。它包含成对的示例,一边是噪声地址字符串(模拟用户输入),另一边是相应的清洁结构化JSON响应。该数据集通过从开源地理编码数据中提取地址组件并故意注入多种类型的噪声来模拟常见的人为错误和输入变化,以训练和评估模型在鲁棒地址解析方面的性能,以及基准测试抗噪声的自然语言处理技术。每个记录包含系统消息、用户输入和助手输出三个主要字段。数据集覆盖了多种语言,并可用于地址解析和提取、噪声鲁棒性测试、多语言数据增强和数据清洗研究等用途。
The mixed-address-parsing dataset is designed to simulate the challenges encountered when processing real-world address inputs. It consists of paired examples of noisy address strings (simulating user input) and their corresponding, clean, structured JSON responses. The dataset was created by extracting components from open geocoding data and deliberately injecting multiple types of noise to mimic common human errors and input variability. It is used to train and evaluate models for robust address parsing, benchmark noise-robust natural language processing techniques, and serves as a resource for developing preprocessing pipelines for data cleaning research. Each record in the dataset contains a system message, a user input, and an assistant output. The dataset covers a variety of languages and can be used for address parsing and extraction, noise robustness testing, multilingual data augmentation, and data cleaning research.
提供机构:
Josephgflowers



