five

shujatoor/ner_instruct-chat

收藏
Hugging Face2024-04-26 更新2024-06-12 收录
下载链接:
https://hf-mirror.com/datasets/shujatoor/ner_instruct-chat
下载链接
链接失效反馈
官方服务:
资源简介:
--- dataset_info: features: - name: Instruction list: - name: content dtype: string - name: role dtype: string splits: - name: train num_bytes: 10301286 num_examples: 16373 download_size: 1553901 dataset_size: 10301286 configs: - config_name: default data_files: - split: train path: data/train-* --- # Instruction Dataset to Finetune LLM for Named Entity Recognition (NER)- Chat Format - This instruction dataset can be used to fine tune your model for the purpose of performing Named Entity Recognition (NER) - The dataset is already formatted as text with system prompt, user instruction and assistant message. - This dataset contains 16.4k instruction examples. - The dataset was created using a little over 2000 receipts data of which some 1500 are original and 500 are fake receipts. - The Original receipts were downloaded from online opensource and some were collected by myself. - The Original receipts are mainly from malaysia, India, US, UK, Canada and Pakistan - Paddleocr was used to perform OCR for the original receipts. - The fake receipts were generated using llama2-70b-4096. These receipts are related to grocery, pet shop, hardware store, furniture, art store, bakery, restaurant, hunting store, ice cream parlor, clothing, shoes, pharmacy etc. ## Summary: - Original Receipts Used ~ 1500 - Fake Receipts Data Used ~ 500 - Fake Receipts data generated using: llama2-70b-4096 - Library used to perform OCR: Paddleocr - Dataset created using prompt to llama2-70b-4096 - Dataset instruction examples: 16.4k - Format: Chat-template
提供机构:
shujatoor
原始信息汇总

数据集概述

数据集基本信息

  • 名称: Instruction Dataset to Finetune LLM for Named Entity Recognition (NER)- Chat Format
  • 目的: 用于微调模型以进行命名实体识别(NER)
  • 格式: 文本格式,包含系统提示、用户指令和助手消息
  • 示例数量: 16,400个指令示例

数据集内容

  • 原始收据: 约1,500份,主要来自马来西亚、印度、美国、英国、加拿大和巴基斯坦
  • 伪造收据: 约500份,使用llama2-70b-4096生成,涉及多种商业类型

数据处理工具

  • OCR工具: Paddleocr
  • 伪造收据生成工具: llama2-70b-4096

数据集结构

  • 特征:
    • Instruction:
      • content: 字符串类型
      • role: 字符串类型
  • 分割:
    • train:
      • 示例数量: 16,373
      • 数据大小: 10,301,286字节
      • 下载大小: 1,553,901字节

数据集配置

  • 默认配置:
    • 数据文件:
      • 分割: train
      • 路径: data/train-*
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作