five

datatab/open-orca-slim-serbian-mistral-prepared

收藏
Hugging Face2024-02-25 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/datatab/open-orca-slim-serbian-mistral-prepared
下载链接
链接失效反馈
官方服务:
资源简介:
--- dataset_info: features: - name: conversations list: - name: from dtype: string - name: value dtype: string - name: weight dtype: float64 splits: - name: train num_bytes: 900437837 num_examples: 514386 download_size: 491241498 dataset_size: 900437837 configs: - config_name: default data_files: - split: train path: data/train-* license: mit task_categories: - text-classification - token-classification - table-question-answering language: - sr pretty_name: Serbian Slim Orca size_categories: - 100K<n<1M --- # Overview > <b>Ovaj dataset je prilagođen treningu Mistral modela</b> Ovaj segment OpenOrca kolekcije predstavlja pažljivo odabranu selekciju koja omogućava ostvarivanje visokih performansi sličnih onima koje se dobijaju korišćenjem obimnijih delova našeg skupa podataka. Ovog puta, fokus je na kompaktnom skupu od približno 500.000 GPT-4 odgovora. Inovacija koja izdvaja ovu verziju jeste detaljan proces revizije uz pomoć GPT-4, gde su izdvojeni i odbačeni odgovori koji nisu u skladu sa standardima kvaliteta utvrđenim na osnovu ljudskih ocena iz FLAN skupa podataka. Zahvaljujući ovom pristupu, veličina skupa podataka svedena je na oko 500.000 stavki, čime se postiže kvalitet sličan ranijim izdanjima, ali sa značajno manjim zahtevima za računarskim resursima, odnosno samo dve trećine uobičajenih potreba. Originalan dataset na engleskom jeziku dostupan je na adresi: https://huggingface.co/datasets/Open-Orca/SlimOrca, gde zainteresovani mogu pristupiti i pregledati izvorni skup podataka. Ova adresa služi kao referentna tačka za one koji žele da uporede originalne podatke sa ovom prilagođenom verzijom. # Demo Models * Uskoro # Note > <i>Dataset je reprezentacija mainško-sisntetičkog prevoda</i> # Code to reproduce this dataset from: https://huggingface.co/datasets/datatab/open-orca-slim-serbian ```terminal pip install datasets ``` ```python from datasets import load_dataset, DatasetDict dataset = load_dataset("datatab/open-orca-slim-serbian", split = "train") mapper = {"system": "SYSTEM:", "human": "USER:", "gpt": "ASSISTANT:"} end_mapper = {"system": "\n\n", "human": "\n", "gpt": "</s>\n"} def formatting_prompts_func(batch): batch_texts = [] # This will store texts for each batch for convos in batch["conversations"]: # Iterate over each batch texts = [] # Store formatted texts for each conversation in the batch for convo in convos: # Iterate over each conversation in the batch # Check if convo is a dictionary if isinstance(convo, dict): turn = convo["from"] text = mapper[turn] + " " + convo["value"] + end_mapper[turn] texts.append(text) else: pass batch_texts.append("".join(texts)) # Concatenate all texts for a single batch item return {"text": batch_texts} # Apply the formatting_prompts_func to the dataset dataset = dataset.map(formatting_prompts_func, batched=True) # split twice and combine train_dev = dataset.train_test_split(shuffle = True, seed = 200, test_size=0.03) test_dev = train_dev['test'].train_test_split(shuffle = True, seed = 200, test_size=0.90) dataset = DatasetDict({ 'train': train_dev['train'], 'test': test_dev['test'], 'eval': test_dev['train'] }) dataset_train = dataset['train'] dataset_test = dataset['test'] dataset_eval = dataset['eval'] ``` --- # Citation ```bibtex @misc{SlimOrca, title = {SlimOrca: An Open Dataset of GPT-4 Augmented FLAN Reasoning Traces, with Verification}, author = {Wing Lian and Guan Wang and Bleys Goodson and Eugene Pentland and Austin Cook and Chanvichet Vong and "Teknium"}, year = {2023}, publisher = {HuggingFace}, url = {https://https://huggingface.co/Open-Orca/SlimOrca} } ``` ```bibtex @misc{mukherjee2023orca, title={Orca: Progressive Learning from Complex Explanation Traces of GPT-4}, author={Subhabrata Mukherjee and Arindam Mitra and Ganesh Jawahar and Sahaj Agarwal and Hamid Palangi and Ahmed Awadallah}, year={2023}, eprint={2306.02707}, archivePrefix={arXiv}, primaryClass={cs.CL} } ``` ```bibtex @misc{longpre2023flan, title={The Flan Collection: Designing Data and Methods for Effective Instruction Tuning}, author={Shayne Longpre and Le Hou and Tu Vu and Albert Webson and Hyung Won Chung and Yi Tay and Denny Zhou and Quoc V. Le and Barret Zoph and Jason Wei and Adam Roberts}, year={2023}, eprint={2301.13688}, archivePrefix={arXiv}, primaryClass={cs.AI} } ```
提供机构:
datatab
原始信息汇总

数据集概述

数据集信息

  • 特征:
    • conversations:
      • from: 字符串类型
      • value: 字符串类型
      • weight: 浮点数类型 (float64)
  • 分割:
    • train:
      • 字节数: 900437837
      • 样本数: 514386
  • 下载大小: 491241498 字节
  • 数据集大小: 900437837 字节

配置

  • 配置名称: default
    • 数据文件:
      • train: data/train-*

许可证

  • MIT

任务类别

  • 文本分类
  • 标记分类
  • 表格问答

语言

  • 塞尔维亚语 (sr)

数据集名称

  • Serbian Slim Orca

数据集大小类别

  • 100K < n < 1M
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作