Xhaheen/Alpaca_urdu_2024_1
收藏Hugging Face2024-03-06 更新2024-06-22 收录
下载链接:
https://hf-mirror.com/datasets/Xhaheen/Alpaca_urdu_2024_1
下载链接
链接失效反馈官方服务:
资源简介:
---
dataset_info:
features:
- name: input
dtype: string
- name: output
dtype: string
- name: answer_lengths
dtype: 'null'
splits:
- name: train
num_bytes: 51251741
num_examples: 45622
download_size: 24545189
dataset_size: 51251741
configs:
- config_name: default
data_files:
- split: train
path: data/train-*
license: apache-2.0
task_categories:
- text-generation
language:
- ur
size_categories:
- 10K<n<100K
---
Description
The Alpaca Urdu 🦙 is a translation of the original dataset into Urdu. This dataset is a part of the Alpaca project and is designed for NLP tasks. 🌐
Dataset Information
Size: The translated dataset contains [45,000] samples.
Languages: Urdu
License: [cc-by-4.0]
Original Dataset: Alpaca Cleaned dataset
Columns
The translated dataset includes the following columns:
input: input text in Urdu.
output: translated output in Urdu.
answer_lengths: Lengths of the answers.
Example Usage
from datasets import load_dataset
# Load the translated dataset
dataset = load_dataset("Xhaheen/Alpaca_urdu_2024_1")
# Access a sample
sample = dataset["train"][0]
print(sample)
##############
import pandas as pd
# Assuming the dataset has a key named "train" containing the data
df = pd.DataFrame(dataset["train"])
# Save the DataFrame to a CSV file named "alpaca_ur.csv"
df.to_csv("alpaca_urdu.csv", index=False)
提供机构:
Xhaheen
原始信息汇总
数据集概述
基本信息
- 数据集名称: Alpaca Urdu 🦙
- 语言: Urdu
- 许可证: Apache-2.0
- 任务类别: 文本生成
- 大小类别: 10K<n<100K
数据结构
- 特征:
- input: 输入文本,数据类型为字符串
- output: 输出文本,数据类型为字符串
- answer_lengths: 答案长度,数据类型为空
数据分割
- 训练集:
- 文件名: data/train-*
- 样本数量: 45622
- 字节数: 51251741
下载信息
- 下载大小: 24545189
- 数据集大小: 51251741
示例用法
python from datasets import load_dataset
加载翻译后的数据集
dataset = load_dataset("Xhaheen/Alpaca_urdu_2024_1")
访问样本
sample = dataset["train"][0] print(sample)



