Rohit228/govt_id_data_v2

Name: Rohit228/govt_id_data_v2
Creator: Rohit228
Published: 2024-05-06 20:17:05
License: 暂无描述

Hugging Face2024-05-06 更新2024-06-12 收录

下载链接：

https://hf-mirror.com/datasets/Rohit228/govt_id_data_v2

下载链接

链接失效反馈

官方服务：

资源简介：

--- size_categories: n<1K dataset_info: - config_name: default features: - name: instruction dtype: string - name: generations sequence: string - name: generation_models sequence: string - name: ratings sequence: int64 - name: rationales sequence: string splits: - name: train num_bytes: 216381 num_examples: 128 download_size: 104150 dataset_size: 216381 - config_name: keep_columns features: - name: instruction dtype: string - name: generations sequence: string - name: generation_models sequence: string - name: ratings sequence: int64 - name: rationales sequence: string splits: - name: train num_bytes: 10313 num_examples: 5 download_size: 24704 dataset_size: 10313 - config_name: push_to_argilla features: - name: Index dtype: int64 - name: instruction dtype: string - name: Class dtype: string - name: Subclass dtype: string - name: Quantization dtype: bool - name: generation_models sequence: string - name: generations sequence: string - name: model_name dtype: string - name: ratings sequence: int64 - name: rationales sequence: string splits: - name: train num_bytes: 10607 num_examples: 5 download_size: 27106 dataset_size: 10607 configs: - config_name: default data_files: - split: train path: data/train-* - config_name: keep_columns data_files: - split: train path: keep_columns/train-* - config_name: push_to_argilla data_files: - split: train path: push_to_argilla/train-* tags: - synthetic - distilabel - rlaif --- <p align="left"> <a href="https://github.com/argilla-io/distilabel"> <img src="https://raw.githubusercontent.com/argilla-io/distilabel/main/docs/assets/distilabel-badge-light.png" alt="Built with Distilabel" width="200" height="32"/> </a> </p> # Dataset Card for govt_id_data_v2 This dataset has been created with [distilabel](https://distilabel.argilla.io/). ## Dataset Summary This dataset contains a `pipeline.yaml` which can be used to reproduce the pipeline that generated it in distilabel using the `distilabel` CLI: ```console distilabel pipeline run --config "https://huggingface.co/datasets/Rohit228/govt_id_data_v2/raw/main/pipeline.yaml" ``` or explore the configuration: ```console distilabel pipeline info --config "https://huggingface.co/datasets/Rohit228/govt_id_data_v2/raw/main/pipeline.yaml" ``` ## Dataset structure The examples have the following structure per configuration: <details><summary> Configuration: default </summary><hr> ```json { "generation_models": [ "flamingo" ], "generations": [ "Here is a sample document for a USA B-1/B-2 Visa (Business or Tourism Visa):\n\n**Application for Visitor (B-1/B-2) Visa**\n\n**Name:** Sophia Patel\n**Date of Birth:** February 12, 1990\n**Nationality:** Indian\n**Passport Number:** ABC1234567\n**Expiration Date:** August 31, 2025\n\n**Purpose of Visit:** Business Meeting with American Partners to Discuss New Product Line (estimated duration: 10 days)\n**Sponsoring Organization:** TechCorp Inc., 123 Main St, Boston, MA 02101\n**Contact Information:** Sophia Patel, [sophia.patel@techcorp.com](mailto:sophia.patel@techcorp.com), +1-617-555-1234\n\n**Travel Itinerary:**\n\n* Departure from: Mumbai, India (February 15, 2023)\n* Arrival at: John F. Kennedy International Airport, New York City, USA (February 16, 2023)\n* Return to: Mumbai, India (February 25, 2023)\n\n**Accommodation:**\nI will be staying at the Hyatt Regency Boston, 980 Commonwealth Ave, Boston, MA 02215\n\n**Financial Support:** My company, TechCorp Inc., will cover all expenses related to my visit, including travel, accommodation, and meals.\n\n**Health Insurance:** I have health insurance coverage through my employer, TechCorp Inc.\n\n**Biographical Information:**\n\nI am a citizen of India and hold a valid passport. I have traveled extensively throughout Asia and Europe, but this will be my first trip to the United States. I do not have any criminal record or prior immigration violations.\n\n**Certification:** I certify that all information provided is true and accurate to the best of my knowledge. I understand that providing false information may result in denial of my visa application or, if discovered after entry into the United States, possible deportation.\n\n**Signature:** Sophia Patel (original signature)\n\nPlease note that this document is a sample and should not be used for actual purposes. Visa applications require specific forms and documentation, which may vary depending on the type of visa and individual circumstances." ], "instruction": "Give a sample document of USA B-1/B-2 Visa (Business or Tourism Visa) with fake data. Do not use any placeholder or *. Give just text", "ratings": [ 5 ], "rationales": [ "The text accurately provides a sample document for a USA B-1/B-2 Visa with all necessary information, including applicant details, purpose of visit, sponsoring organization, travel itinerary, accommodation, financial support, health insurance, and biographical information. It is truthful and free of hallucinations. The text follows the instruction given precisely." ] } ``` This subset can be loaded as: ```python from datasets import load_dataset ds = load_dataset("Rohit228/govt_id_data_v2", "default") ``` Or simply as it follows, since there's only one configuration and is named `default`: ```python from datasets import load_dataset ds = load_dataset("Rohit228/govt_id_data_v2") ``` </details>

提供机构：

Rohit228

原始信息汇总

数据集概述

基本信息

数据集大小: 小于1KB (size_categories: n<1K)

数据集配置

配置名称: default, keep_columns, push_to_argilla

特征信息

默认配置 (default) 和 keep_columns 配置:
- instruction: 字符串类型
- generations: 字符串序列
- generation_models: 字符串序列
- ratings: 整数序列 (int64)
- rationales: 字符串序列
push_to_argilla 配置:
- Index: 整数类型 (int64)
- instruction: 字符串类型
- Class: 字符串类型
- Subclass: 字符串类型
- Quantization: 布尔类型
- generation_models: 字符串序列
- generations: 字符串序列
- model_name: 字符串类型
- ratings: 整数序列 (int64)
- rationales: 字符串序列

数据集分割

训练集 (train):
- 默认配置 (default):
  - 数据大小: 216381字节
  - 示例数量: 128
  - 下载大小: 104150字节
- keep_columns 配置:
  - 数据大小: 10313字节
  - 示例数量: 5
  - 下载大小: 24704字节
- push_to_argilla 配置:
  - 数据大小: 10607字节
  - 示例数量: 5
  - 下载大小: 27106字节

数据文件路径

默认配置 (default): data/train-*
keep_columns 配置: keep_columns/train-*
push_to_argilla 配置: push_to_argilla/train-*