Rohit228/govt_id_data_v2
收藏Hugging Face2024-05-06 更新2024-06-12 收录
下载链接:
https://hf-mirror.com/datasets/Rohit228/govt_id_data_v2
下载链接
链接失效反馈官方服务:
资源简介:
---
size_categories: n<1K
dataset_info:
- config_name: default
features:
- name: instruction
dtype: string
- name: generations
sequence: string
- name: generation_models
sequence: string
- name: ratings
sequence: int64
- name: rationales
sequence: string
splits:
- name: train
num_bytes: 216381
num_examples: 128
download_size: 104150
dataset_size: 216381
- config_name: keep_columns
features:
- name: instruction
dtype: string
- name: generations
sequence: string
- name: generation_models
sequence: string
- name: ratings
sequence: int64
- name: rationales
sequence: string
splits:
- name: train
num_bytes: 10313
num_examples: 5
download_size: 24704
dataset_size: 10313
- config_name: push_to_argilla
features:
- name: Index
dtype: int64
- name: instruction
dtype: string
- name: Class
dtype: string
- name: Subclass
dtype: string
- name: Quantization
dtype: bool
- name: generation_models
sequence: string
- name: generations
sequence: string
- name: model_name
dtype: string
- name: ratings
sequence: int64
- name: rationales
sequence: string
splits:
- name: train
num_bytes: 10607
num_examples: 5
download_size: 27106
dataset_size: 10607
configs:
- config_name: default
data_files:
- split: train
path: data/train-*
- config_name: keep_columns
data_files:
- split: train
path: keep_columns/train-*
- config_name: push_to_argilla
data_files:
- split: train
path: push_to_argilla/train-*
tags:
- synthetic
- distilabel
- rlaif
---
<p align="left">
<a href="https://github.com/argilla-io/distilabel">
<img src="https://raw.githubusercontent.com/argilla-io/distilabel/main/docs/assets/distilabel-badge-light.png" alt="Built with Distilabel" width="200" height="32"/>
</a>
</p>
# Dataset Card for govt_id_data_v2
This dataset has been created with [distilabel](https://distilabel.argilla.io/).
## Dataset Summary
This dataset contains a `pipeline.yaml` which can be used to reproduce the pipeline that generated it in distilabel using the `distilabel` CLI:
```console
distilabel pipeline run --config "https://huggingface.co/datasets/Rohit228/govt_id_data_v2/raw/main/pipeline.yaml"
```
or explore the configuration:
```console
distilabel pipeline info --config "https://huggingface.co/datasets/Rohit228/govt_id_data_v2/raw/main/pipeline.yaml"
```
## Dataset structure
The examples have the following structure per configuration:
<details><summary> Configuration: default </summary><hr>
```json
{
"generation_models": [
"flamingo"
],
"generations": [
"Here is a sample document for a USA B-1/B-2 Visa (Business or Tourism Visa):\n\n**Application for Visitor (B-1/B-2) Visa**\n\n**Name:** Sophia Patel\n**Date of Birth:** February 12, 1990\n**Nationality:** Indian\n**Passport Number:** ABC1234567\n**Expiration Date:** August 31, 2025\n\n**Purpose of Visit:** Business Meeting with American Partners to Discuss New Product Line (estimated duration: 10 days)\n**Sponsoring Organization:** TechCorp Inc., 123 Main St, Boston, MA 02101\n**Contact Information:** Sophia Patel, [sophia.patel@techcorp.com](mailto:sophia.patel@techcorp.com), +1-617-555-1234\n\n**Travel Itinerary:**\n\n* Departure from: Mumbai, India (February 15, 2023)\n* Arrival at: John F. Kennedy International Airport, New York City, USA (February 16, 2023)\n* Return to: Mumbai, India (February 25, 2023)\n\n**Accommodation:**\nI will be staying at the Hyatt Regency Boston, 980 Commonwealth Ave, Boston, MA 02215\n\n**Financial Support:** My company, TechCorp Inc., will cover all expenses related to my visit, including travel, accommodation, and meals.\n\n**Health Insurance:** I have health insurance coverage through my employer, TechCorp Inc.\n\n**Biographical Information:**\n\nI am a citizen of India and hold a valid passport. I have traveled extensively throughout Asia and Europe, but this will be my first trip to the United States. I do not have any criminal record or prior immigration violations.\n\n**Certification:** I certify that all information provided is true and accurate to the best of my knowledge. I understand that providing false information may result in denial of my visa application or, if discovered after entry into the United States, possible deportation.\n\n**Signature:** Sophia Patel (original signature)\n\nPlease note that this document is a sample and should not be used for actual purposes. Visa applications require specific forms and documentation, which may vary depending on the type of visa and individual circumstances."
],
"instruction": "Give a sample document of USA B-1/B-2 Visa (Business or Tourism Visa) with fake data. Do not use any placeholder or *. Give just text",
"ratings": [
5
],
"rationales": [
"The text accurately provides a sample document for a USA B-1/B-2 Visa with all necessary information, including applicant details, purpose of visit, sponsoring organization, travel itinerary, accommodation, financial support, health insurance, and biographical information. It is truthful and free of hallucinations. The text follows the instruction given precisely."
]
}
```
This subset can be loaded as:
```python
from datasets import load_dataset
ds = load_dataset("Rohit228/govt_id_data_v2", "default")
```
Or simply as it follows, since there's only one configuration and is named `default`:
```python
from datasets import load_dataset
ds = load_dataset("Rohit228/govt_id_data_v2")
```
</details>
提供机构:
Rohit228
原始信息汇总
数据集概述
基本信息
- 数据集大小: 小于1KB (
size_categories: n<1K)
数据集配置
- 配置名称:
default,keep_columns,push_to_argilla
特征信息
-
默认配置 (
default) 和keep_columns配置:instruction: 字符串类型generations: 字符串序列generation_models: 字符串序列ratings: 整数序列 (int64)rationales: 字符串序列
-
push_to_argilla配置:Index: 整数类型 (int64)instruction: 字符串类型Class: 字符串类型Subclass: 字符串类型Quantization: 布尔类型generation_models: 字符串序列generations: 字符串序列model_name: 字符串类型ratings: 整数序列 (int64)rationales: 字符串序列
数据集分割
- 训练集 (
train):- 默认配置 (
default):- 数据大小: 216381字节
- 示例数量: 128
- 下载大小: 104150字节
keep_columns配置:- 数据大小: 10313字节
- 示例数量: 5
- 下载大小: 24704字节
push_to_argilla配置:- 数据大小: 10607字节
- 示例数量: 5
- 下载大小: 27106字节
- 默认配置 (
数据文件路径
- 默认配置 (
default):data/train-* keep_columns配置:keep_columns/train-*push_to_argilla配置:push_to_argilla/train-*
标签
- 数据集标签:
synthetic,distilabel,rlaif



