five

alban-labs/databricks-dolly-15k-sq

收藏
Hugging Face2024-09-01 更新2025-04-12 收录
下载链接:
https://hf-mirror.com/datasets/alban-labs/databricks-dolly-15k-sq
下载链接
链接失效反馈
官方服务:
资源简介:
--- dataset_info: features: - name: instruction_en dtype: string - name: context_en dtype: string - name: response_en dtype: string - name: category dtype: string - name: instruction_sq dtype: string - name: context_sq dtype: string - name: response_sq dtype: string splits: - name: train num_bytes: 26489107 num_examples: 15011 download_size: 16705533 dataset_size: 26489107 configs: - config_name: default data_files: - split: train path: data/train-* language: - sq size_categories: - 10K<n<100K --- # alban-labs/databricks-dolly-15k-sq ## Summary `alban-labs/databricks-dolly-15k-sq` is a machine-translated version of the `databricks/databricks-dolly-15k` dataset into Albanian. The original dataset, created by Databricks employees, consists of instruction-following records across various behavioral categories such as brainstorming, classification, closed QA, generation, information extraction, open QA, and summarization. This Albanian translation has been generated using the LLaMA 3.1 405B model. ## Supported Tasks - Training LLMs - Synthetic Data Generation - Data Augmentation ## Languages - Albanian ## Version 1.0 ## Dataset Overview The `databricks-dolly-15k` dataset, originally in English, contains over 15,000 records generated by Databricks employees. It is designed to help large language models exhibit interactive behavior similar to ChatGPT. This dataset includes prompt/response pairs across eight instruction categories. The translation into Albanian allows for broader accessibility and usability in Albanian-speaking contexts. The original dataset was generated with strict guidelines to avoid web-based information (except Wikipedia) and generative AI in creating prompts and responses. This translation retains the integrity of the original dataset while making it available for Albanian speakers. ## Intended Uses The translated dataset is useful for: - **Fine-tuning Language Models**: Use this dataset to train models on Albanian instructions and responses. - **Synthetic Data Generation**: Generate additional instruction-response pairs using the Albanian translation. - **Data Augmentation**: Employ the dataset to augment training data with translated examples. ## Dataset ### Purpose of Collection This dataset is part of an initiative to make high-quality instruction-following data available in multiple languages. By translating `databricks/databricks-dolly-15k` into Albanian, we aim to support the development and fine-tuning of language models for Albanian language applications. ### Sources - **Human-Generated Data**: The dataset was translated from the English version, retaining the original structure and categories. - **Translation Model**: The translation was performed using the LLaMA 3.1 405B model. ### Annotation Guidelines The translation maintains the original annotation categories and guidelines, including: - **Creative Writing** - **Closed QA** - **Open QA** - **Summarization** - **Information Extraction** - **Classification** - **Brainstorming** ## Language - Albanian ## Known Limitations - The dataset may inherit biases and factual errors from the original dataset and the translation model. - The quality of translation may vary based on the nuances of the Albanian language and the model's performance. ## Citation If you use this dataset, please cite the original dataset and the translation work as follows: ```bibtex @online{DatabricksBlog2023DollyV2, author = {Mike Conover and Matt Hayes and Ankit Mathur and Jianwei Xie and Jun Wan and Sam Shah and Ali Ghodsi and Patrick Wendell and Matei Zaharia and Reynold Xin}, title = {Free Dolly: Introducing the World's First Truly Open Instruction-Tuned LLM}, year = {2023}, url = {https://www.databricks.com/blog/2023/04/12/dolly-first-open-commercially-viable-instruction-tuned-llm}, urldate = {2023-06-30} } @misc{LLaMA3.1, author = {Meta AI}, title = {LLaMA 3.1 405B}, year = {2024}, url = {https://ai.meta.com/llama} }
提供机构:
alban-labs
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作