five

CIIRC-NLP/gsm8k-cs

收藏
Hugging Face2024-09-03 更新2024-06-12 收录
下载链接:
https://hf-mirror.com/datasets/CIIRC-NLP/gsm8k-cs
下载链接
链接失效反馈
官方服务:
资源简介:
--- dataset_info: features: - name: question dtype: string - name: answer dtype: int64 - name: thoughts dtype: string splits: - name: train num_bytes: 3815742 num_examples: 7473 - name: test num_bytes: 690959 num_examples: 1319 download_size: 2747390 dataset_size: 4506701 configs: - config_name: default data_files: - split: train path: data/train-* - split: test path: data/test-* language: - cs pretty_name: Czech GSM8K license: mit size_categories: - 1K<n<10K --- # Czech GSM8K This is a Czech translation of the original [GSM8K](https://huggingface.co/datasets/gsm8k) dataset, created using the [WMT 21 En-X](https://huggingface.co/facebook/wmt21-dense-24-wide-en-x) model. The socratic variant of the original dataset is not included. The translation was completed for use within the [Czech-Bench](https://gitlab.com/jirkoada/czech-bench) evaluation framework. The script used for translation can be reviewed [here](https://gitlab.com/jirkoada/czech-bench/-/blob/main/benchmarks/dataset_translation.py?ref_type=heads). ## Dataset structure The dataset structure has been altered to ease the translation process. The original 'answer' field has been split into the 'thoughts' field, containing the thought process string stripped of the calculation commands enclosed in '<<>>', and the 'answer' field, containing the final numerical answer as an integer. This process can be replicated with the original data using the following map function: ```python def map_fn(example): match = re.search(r'^(.*)####', example['answer'], re.DOTALL) if match: example['thoughts'] = re.sub(r'<<.*?>>', '', match.group(1)) else: example['thoughts'] = "" match = re.search(r'#### ([\d,-]+)$', example['answer']) ans = match.group(1) ans = ans.replace(',', '') example['answer'] = int(ans) return example ``` ## Citation Original dataset: ```bibtex @article{cobbe2021gsm8k, title={Training Verifiers to Solve Math Word Problems}, author={Cobbe, Karl and Kosaraju, Vineet and Bavarian, Mohammad and Chen, Mark and Jun, Heewoo and Kaiser, Lukasz and Plappert, Matthias and Tworek, Jerry and Hilton, Jacob and Nakano, Reiichiro and Hesse, Christopher and Schulman, John}, journal={arXiv preprint arXiv:2110.14168}, year={2021} } ``` Czech translation: ```bibtex @masterthesis{jirkovsky-thesis, author = {Jirkovský, Adam}, title = {Benchmarking Techniques for Evaluation of Large Language Models}, school = {Czech Technical University in Prague, Faculty of Electrical Engineering}, year = 2024, URL = {https://dspace.cvut.cz/handle/10467/115227} } ```
提供机构:
CIIRC-NLP
原始信息汇总

数据集概述

数据集名称

Czech GSM8K

数据集特征

  • question: 数据类型为字符串
  • answer: 数据类型为整数(int64)
  • thoughts: 数据类型为字符串

数据集分割

  • train: 包含7473个样本,总大小为3815742字节
  • test: 包含1319个样本,总大小为690959字节

数据集大小

  • 下载大小: 2747390字节
  • 数据集总大小: 4506701字节

语言

  • 捷克语(cs)

许可证

  • MIT

大小类别

  • 1K<n<10K

数据集结构

  • 原始的answer字段被拆分为thoughts和answer两个字段。thoughts包含去除计算命令的思维过程字符串,answer包含最终的数值答案。

引用

5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作