parsak/alpagasus-9k-tr

Name: parsak/alpagasus-9k-tr
Creator: parsak
Published: 2024-03-01 08:43:06
License: 暂无描述

Hugging Face2024-03-01 更新2024-03-04 收录

下载链接：

https://hf-mirror.com/datasets/parsak/alpagasus-9k-tr

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: mit dataset_info: features: - name: og_id dtype: int64 - name: instruction dtype: string - name: input dtype: string - name: output dtype: string splits: - name: train num_bytes: 4345803 num_examples: 9181 download_size: 2695286 dataset_size: 4345803 configs: - config_name: default data_files: - split: train path: data/train-* --- # Dataset Card for Dataset Name  This dataset is a [Alpagasus](https://lichang-chen.github.io/AlpaGasus/) high quality subset mapped on [merve/turkish_instructions](https://huggingface.co/datasets/merve/turkish_instructions) ## Dataset Details ### Dataset Description  Based on [Alpagasus](https://lichang-chen.github.io/AlpaGasus/)'s paper, a subset of higher quality instruction-answer pairs from the original alpaca dataset, resulted into higher quality fine-tuned models. In April 2023, the turkish translation of Alpaca dataset was released by Merve ([merve/turkish_instructions](https://huggingface.co/datasets/merve/turkish_instructions)). But the indexing was shuffled and the Alpagasus filtered dataset couldn't be directly mapped to the turkish dataset. My task was to find the parallel sentences in the original and translated versions of the dataset. I encoded the english and turkish sentences and calculate the cosine similarity between their embedding vectors. The sentences with the highest similarity scores are considered as parallel sentences. Using [SBert](https://www.sbert.net/index.html)'s SentenceTransformers library, we can calculate the semantic similarity between the original and translated versions of the dataset. (Inspired by [Marging Based Mining](https://www.sbert.net/examples/applications/parallel-sentence-mining/README.html#marging-based-mining) - [Artetxe and Schwenk, Section 4.3](https://arxiv.org/pdf/1808.08745.pdf)) - **Curated by:** [ParsaK](https://huggingface.co/parsak) at [Cosmos](https://huggingface.co/ytu-ce-cosmos) - **Language(s) (NLP):** Turkish - **License:** [MIT](https://opensource.org/license/mit) ### Dataset Sources  - **The Original Dataset:** [tatsu-lab/alpaca](https://huggingface.co/datasets/tatsu-lab/alpaca) - **Filtered Dataset:** [gpt4life's unofficial dataset release](https://github.com/gpt4life/alpagasus/blob/main/data/filtered/chatgpt_9k.json) - **The Turkish Translations:** [merve/turkish_instructions](https://huggingface.co/datasets/merve/turkish_instructions) ## Citation  **BibTeX:** [More Information Needed] **APA:** [More Information Needed] ## Dataset Card Contact [ParsaK](https://huggingface.co/parsak)

提供机构：

parsak

原始信息汇总