shujatoor/ner_instruct-chat
收藏Hugging Face2024-04-26 更新2024-06-12 收录
下载链接:
https://hf-mirror.com/datasets/shujatoor/ner_instruct-chat
下载链接
链接失效反馈官方服务:
资源简介:
---
dataset_info:
features:
- name: Instruction
list:
- name: content
dtype: string
- name: role
dtype: string
splits:
- name: train
num_bytes: 10301286
num_examples: 16373
download_size: 1553901
dataset_size: 10301286
configs:
- config_name: default
data_files:
- split: train
path: data/train-*
---
# Instruction Dataset to Finetune LLM for Named Entity Recognition (NER)- Chat Format
- This instruction dataset can be used to fine tune your model for the purpose of performing Named Entity Recognition (NER)
- The dataset is already formatted as text with system prompt, user instruction and assistant message.
- This dataset contains 16.4k instruction examples.
- The dataset was created using a little over 2000 receipts data of which some 1500 are original and 500
are fake receipts.
- The Original receipts were downloaded from online opensource and some were collected by myself.
- The Original receipts are mainly from malaysia, India, US, UK, Canada and Pakistan
- Paddleocr was used to perform OCR for the original receipts.
- The fake receipts were generated using llama2-70b-4096. These receipts are related to grocery, pet shop,
hardware store, furniture, art store, bakery, restaurant, hunting store, ice cream parlor, clothing,
shoes, pharmacy etc.
## Summary:
- Original Receipts Used ~ 1500
- Fake Receipts Data Used ~ 500
- Fake Receipts data generated using: llama2-70b-4096
- Library used to perform OCR: Paddleocr
- Dataset created using prompt to llama2-70b-4096
- Dataset instruction examples: 16.4k
- Format: Chat-template
提供机构:
shujatoor
原始信息汇总
数据集概述
数据集基本信息
- 名称: Instruction Dataset to Finetune LLM for Named Entity Recognition (NER)- Chat Format
- 目的: 用于微调模型以进行命名实体识别(NER)
- 格式: 文本格式,包含系统提示、用户指令和助手消息
- 示例数量: 16,400个指令示例
数据集内容
- 原始收据: 约1,500份,主要来自马来西亚、印度、美国、英国、加拿大和巴基斯坦
- 伪造收据: 约500份,使用
llama2-70b-4096生成,涉及多种商业类型
数据处理工具
- OCR工具: Paddleocr
- 伪造收据生成工具:
llama2-70b-4096
数据集结构
- 特征:
- Instruction:
- content: 字符串类型
- role: 字符串类型
- Instruction:
- 分割:
- train:
- 示例数量: 16,373
- 数据大小: 10,301,286字节
- 下载大小: 1,553,901字节
- train:
数据集配置
- 默认配置:
- 数据文件:
- 分割: train
- 路径: data/train-*
- 数据文件:



