five

parsak/alpagasus-9k-tr

收藏
Hugging Face2024-03-01 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/parsak/alpagasus-9k-tr
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: mit dataset_info: features: - name: og_id dtype: int64 - name: instruction dtype: string - name: input dtype: string - name: output dtype: string splits: - name: train num_bytes: 4345803 num_examples: 9181 download_size: 2695286 dataset_size: 4345803 configs: - config_name: default data_files: - split: train path: data/train-* --- # Dataset Card for Dataset Name <!-- Provide a quick summary of the dataset. --> This dataset is a [Alpagasus](https://lichang-chen.github.io/AlpaGasus/) high quality subset mapped on [merve/turkish_instructions](https://huggingface.co/datasets/merve/turkish_instructions) ## Dataset Details ### Dataset Description <!-- Provide a longer summary of what this dataset is. --> Based on [Alpagasus](https://lichang-chen.github.io/AlpaGasus/)'s paper, a subset of higher quality instruction-answer pairs from the original alpaca dataset, resulted into higher quality fine-tuned models. In April 2023, the turkish translation of Alpaca dataset was released by Merve ([merve/turkish_instructions](https://huggingface.co/datasets/merve/turkish_instructions)). But the indexing was shuffled and the Alpagasus filtered dataset couldn't be directly mapped to the turkish dataset. My task was to find the parallel sentences in the original and translated versions of the dataset. I encoded the english and turkish sentences and calculate the cosine similarity between their embedding vectors. The sentences with the highest similarity scores are considered as parallel sentences. Using [SBert](https://www.sbert.net/index.html)'s SentenceTransformers library, we can calculate the semantic similarity between the original and translated versions of the dataset. (Inspired by [Marging Based Mining](https://www.sbert.net/examples/applications/parallel-sentence-mining/README.html#marging-based-mining) - [Artetxe and Schwenk, Section 4.3](https://arxiv.org/pdf/1808.08745.pdf)) - **Curated by:** [ParsaK](https://huggingface.co/parsak) at [Cosmos](https://huggingface.co/ytu-ce-cosmos) - **Language(s) (NLP):** Turkish - **License:** [MIT](https://opensource.org/license/mit) ### Dataset Sources <!-- Provide the basic links for the dataset. --> - **The Original Dataset:** [tatsu-lab/alpaca](https://huggingface.co/datasets/tatsu-lab/alpaca) - **Filtered Dataset:** [gpt4life's unofficial dataset release](https://github.com/gpt4life/alpagasus/blob/main/data/filtered/chatgpt_9k.json) - **The Turkish Translations:** [merve/turkish_instructions](https://huggingface.co/datasets/merve/turkish_instructions) ## Citation <!-- If there is a paper or blog post introducing the dataset, the APA and Bibtex information for that should go in this section. --> **BibTeX:** [More Information Needed] **APA:** [More Information Needed] ## Dataset Card Contact [ParsaK](https://huggingface.co/parsak)
提供机构:
parsak
原始信息汇总

数据集卡片

数据集描述

数据集特征

  • og_id: 数据类型为 int64
  • instruction: 数据类型为 string
  • input: 数据类型为 string
  • output: 数据类型为 string

数据集分割

  • train: 包含 9181 个样本,总字节数为 4345803

数据集大小

  • 下载大小: 2695286 字节
  • 数据集大小: 4345803 字节

配置

  • default: 数据文件路径为 data/train-*

语言

  • 土耳其语

许可证

  • MIT

数据集来源

数据集卡片联系人

5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作