parsak/alpagasus-9k-tr
收藏Hugging Face2024-03-01 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/parsak/alpagasus-9k-tr
下载链接
链接失效反馈官方服务:
资源简介:
---
license: mit
dataset_info:
features:
- name: og_id
dtype: int64
- name: instruction
dtype: string
- name: input
dtype: string
- name: output
dtype: string
splits:
- name: train
num_bytes: 4345803
num_examples: 9181
download_size: 2695286
dataset_size: 4345803
configs:
- config_name: default
data_files:
- split: train
path: data/train-*
---
# Dataset Card for Dataset Name
<!-- Provide a quick summary of the dataset. -->
This dataset is a [Alpagasus](https://lichang-chen.github.io/AlpaGasus/) high quality subset mapped on [merve/turkish_instructions](https://huggingface.co/datasets/merve/turkish_instructions)
## Dataset Details
### Dataset Description
<!-- Provide a longer summary of what this dataset is. -->
Based on [Alpagasus](https://lichang-chen.github.io/AlpaGasus/)'s paper, a subset of higher quality instruction-answer pairs from the original alpaca dataset, resulted into higher quality fine-tuned models.
In April 2023, the turkish translation of Alpaca dataset was released by Merve ([merve/turkish_instructions](https://huggingface.co/datasets/merve/turkish_instructions)).
But the indexing was shuffled and the Alpagasus filtered dataset couldn't be directly mapped to the turkish dataset.
My task was to find the parallel sentences in the original and translated versions of the dataset. I encoded the english and turkish sentences and calculate the cosine similarity between their embedding vectors. The sentences with the highest similarity scores are considered as parallel sentences.
Using [SBert](https://www.sbert.net/index.html)'s SentenceTransformers library, we can calculate the semantic similarity between the original and translated versions of the dataset.
(Inspired by [Marging Based Mining](https://www.sbert.net/examples/applications/parallel-sentence-mining/README.html#marging-based-mining) - [Artetxe and Schwenk, Section 4.3](https://arxiv.org/pdf/1808.08745.pdf))
- **Curated by:** [ParsaK](https://huggingface.co/parsak) at [Cosmos](https://huggingface.co/ytu-ce-cosmos)
- **Language(s) (NLP):** Turkish
- **License:** [MIT](https://opensource.org/license/mit)
### Dataset Sources
<!-- Provide the basic links for the dataset. -->
- **The Original Dataset:** [tatsu-lab/alpaca](https://huggingface.co/datasets/tatsu-lab/alpaca)
- **Filtered Dataset:** [gpt4life's unofficial dataset release](https://github.com/gpt4life/alpagasus/blob/main/data/filtered/chatgpt_9k.json)
- **The Turkish Translations:** [merve/turkish_instructions](https://huggingface.co/datasets/merve/turkish_instructions)
## Citation
<!-- If there is a paper or blog post introducing the dataset, the APA and Bibtex information for that should go in this section. -->
**BibTeX:**
[More Information Needed]
**APA:**
[More Information Needed]
## Dataset Card Contact
[ParsaK](https://huggingface.co/parsak)
提供机构:
parsak
原始信息汇总
数据集卡片
数据集描述
数据集特征
- og_id: 数据类型为
int64 - instruction: 数据类型为
string - input: 数据类型为
string - output: 数据类型为
string
数据集分割
- train: 包含 9181 个样本,总字节数为 4345803
数据集大小
- 下载大小: 2695286 字节
- 数据集大小: 4345803 字节
配置
- default: 数据文件路径为
data/train-*
语言
- 土耳其语
许可证
- MIT
数据集来源
- 原始数据集: tatsu-lab/alpaca
- 过滤后的数据集: gpt4lifes unofficial dataset release
- 土耳其语翻译: merve/turkish_instructions



