five

CENSUS-NER-Name-Email-Address-Phone

收藏
魔搭社区2025-11-27 更新2025-09-06 收录
下载链接:
https://modelscope.cn/datasets/Josephgflowers/CENSUS-NER-Name-Email-Address-Phone
下载链接
链接失效反馈
官方服务:
资源简介:
Dataset Summary The CENSUS-NER-Name-Email-Address-Phone dataset is a processed and structured version of the FMCSA (Federal Motor Carrier Safety Administration) CENSUS1 2016Sep dataset. It is designed to assist in training language models for tasks such as Named Entity Recognition (NER), address parsing, and information extraction from unstructured text. The dataset contains records that include information such as name, email, phone number, and address, extracted from the original dataset and presented in a structured format suitable for natural language processing (NLP) tasks. Key Features: Structured Data: The dataset is organized with three key columns: system, user, and assistant, representing different parts of the NLP prompt-response interaction. Address Normalization: The dataset includes normalized address information, with extracted house numbers, streets, cities, states, postal codes, and countries. Flexible Data Representation: Available in both TXT and CSV formats, the dataset is versatile for various training pipelines, including fine-tuning language models and developing AI assistants. Supported Tasks: Named Entity Recognition (NER) Address Parsing Information Extraction Natural Language Processing (NLP) Source Data The original data was sourced from the FMCSA CENSUS1 2016Sep dataset, which contains detailed records on motor carriers, including contact information and operational data. The dataset was restructured and processed to focus on extracting and normalizing key information fields such as names, emails, phone numbers, and addresses. Citation for Original Dataset If you use the FMCSA CENSUS1 2016Sep Address Extraction dataset, please also cite the original FMCSA dataset as follows: bibtex @misc{FMCSA2016, title = {Federal Motor Carrier Safety Administration (FMCSA) CENSUS1 2016Sep Dataset}, year = {2016}, howpublished = {https://www.fmcsa.dot.gov}, note = {Data accessed: 2016-09-01} } Dataset Structure Data Fields: system: The prompt provided to the model, instructing it to extract specific fields from the user input. user: The input text containing unstructured data from which the model extracts information. assistant: The model-generated output, formatted as JSON, containing the extracted fields: name, email, phone_number, and address. Example Entry: json { "system": "Extract the following information from the user input: Name, Email, Phone number, and Address. If a field is missing, ignore it and don't output anything regarding this field. Return the answer in JSON format.", "user": "John Doe, john.doe@example.com, 555-1234, 123 Main St, Anytown, NY, 12345, USA. Extra Info: ...", "assistant": { "name": "John Doe", "email": "john.doe@example.com", "phone_number": "555-1234", "address": "123 Main St, Anytown, NY, 12345, USA" } } Languages The dataset is in English, with text sourced from records maintained by the FMCSA. Usage This dataset can be used to train and evaluate models for tasks like Named Entity Recognition (NER), address parsing, and information extraction. The structured nature of the dataset makes it ideal for fine-tuning NLP models that need to understand and extract structured information from unstructured text. Acknowledgements We acknowledge the FMCSA for providing the original dataset used in this work. Their commitment to maintaining and sharing such data is invaluable to the research community.

数据集概述 CENSUS-NER-Name-Email-Address-Phone 数据集是美国联邦机动车承运人安全管理局(Federal Motor Carrier Safety Administration,FMCSA)CENSUS1 2016年9月数据集的经处理结构化版本。本数据集旨在辅助训练语言模型,以完成命名实体识别(Named Entity Recognition, NER)、地址解析以及非结构化文本信息抽取等任务。数据集包含从原始数据集提取并整理为适配自然语言处理(Natural Language Processing, NLP)任务的结构化格式的姓名、电子邮件、电话号码、地址等信息的记录。 核心特性: 结构化数据:数据集以三个关键列组织:system、user与assistant,分别代表自然语言处理提示-回复交互的不同环节。 地址标准化:数据集包含标准化后的地址信息,已提取门牌号、街道、城市、州、邮政编码及国家。 灵活的数据表示形式:支持TXT与CSV两种格式,可适配多种训练流程,包括大语言模型微调与AI智能体开发。 支持任务: 命名实体识别(NER) 地址解析 信息抽取 自然语言处理(NLP) 源数据 原始数据源自FMCSA CENSUS1 2016年9月数据集,该数据集包含机动车承运人的详细记录,涵盖联系信息与运营数据。本数据集经过重构与处理,专注于提取并标准化姓名、电子邮件、电话号码及地址等关键信息字段。 原始数据集引用 若使用FMCSA CENSUS1 2016年9月地址抽取数据集,请同时按照如下格式引用原始FMCSA数据集: bibtex @misc{FMCSA2016, title = {联邦机动车承运人安全管理局(FMCSA)CENSUS1 2016年9月数据集}, year = {2016}, howpublished = {https://www.fmcsa.dot.gov}, note = {数据访问日期:2016-09-01} } 数据集结构 数据字段: system:提供给模型的提示指令,指导其从用户输入中提取指定字段。 user:包含待抽取信息的非结构化文本输入。 assistant:模型生成的输出,格式为JSON,包含已提取的字段:姓名、电子邮件、电话号码及地址。 示例条目: json { "system": "从用户输入中提取以下信息:姓名、电子邮件、电话号码与地址。若某字段缺失,则忽略该字段且不输出相关内容。请以JSON格式返回结果。", "user": "John Doe, john.doe@example.com, 555-1234, 123 Main St, Anytown, NY, 12345, USA. Extra Info: ...", "assistant": { "name": "John Doe", "email": "john.doe@example.com", "phone_number": "555-1234", "address": "123 Main St, Anytown, NY, 12345, USA" } } 语言 本数据集采用英语,文本源自FMCSA维护的记录。 使用场景 本数据集可用于训练与评估命名实体识别、地址解析及信息抽取等任务的模型。其结构化特性使其非常适合用于微调需要从非结构化文本中理解并提取结构化信息的自然语言处理模型。 致谢 感谢FMCSA为本研究提供原始数据集。其对这类数据的维护与共享工作,对研究社区而言具有不可估量的价值。
提供机构:
maas
创建时间:
2025-08-31
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作