CIIRC-NLP/gsm8k-cs
收藏Hugging Face2024-09-03 更新2024-06-12 收录
下载链接:
https://hf-mirror.com/datasets/CIIRC-NLP/gsm8k-cs
下载链接
链接失效反馈官方服务:
资源简介:
---
dataset_info:
features:
- name: question
dtype: string
- name: answer
dtype: int64
- name: thoughts
dtype: string
splits:
- name: train
num_bytes: 3815742
num_examples: 7473
- name: test
num_bytes: 690959
num_examples: 1319
download_size: 2747390
dataset_size: 4506701
configs:
- config_name: default
data_files:
- split: train
path: data/train-*
- split: test
path: data/test-*
language:
- cs
pretty_name: Czech GSM8K
license: mit
size_categories:
- 1K<n<10K
---
# Czech GSM8K
This is a Czech translation of the original [GSM8K](https://huggingface.co/datasets/gsm8k) dataset, created using the [WMT 21 En-X](https://huggingface.co/facebook/wmt21-dense-24-wide-en-x) model.
The socratic variant of the original dataset is not included.
The translation was completed for use within the [Czech-Bench](https://gitlab.com/jirkoada/czech-bench) evaluation framework.
The script used for translation can be reviewed [here](https://gitlab.com/jirkoada/czech-bench/-/blob/main/benchmarks/dataset_translation.py?ref_type=heads).
## Dataset structure
The dataset structure has been altered to ease the translation process.
The original 'answer' field has been split into the 'thoughts' field, containing the thought process string stripped of the calculation commands enclosed in '<<>>', and the 'answer' field, containing the final numerical answer as an integer.
This process can be replicated with the original data using the following map function:
```python
def map_fn(example):
match = re.search(r'^(.*)####', example['answer'], re.DOTALL)
if match:
example['thoughts'] = re.sub(r'<<.*?>>', '', match.group(1))
else:
example['thoughts'] = ""
match = re.search(r'#### ([\d,-]+)$', example['answer'])
ans = match.group(1)
ans = ans.replace(',', '')
example['answer'] = int(ans)
return example
```
## Citation
Original dataset:
```bibtex
@article{cobbe2021gsm8k,
title={Training Verifiers to Solve Math Word Problems},
author={Cobbe, Karl and Kosaraju, Vineet and Bavarian, Mohammad and Chen, Mark and Jun, Heewoo and Kaiser, Lukasz and Plappert, Matthias and Tworek, Jerry and Hilton, Jacob and Nakano, Reiichiro and Hesse, Christopher and Schulman, John},
journal={arXiv preprint arXiv:2110.14168},
year={2021}
}
```
Czech translation:
```bibtex
@masterthesis{jirkovsky-thesis,
author = {Jirkovský, Adam},
title = {Benchmarking Techniques for Evaluation of Large Language Models},
school = {Czech Technical University in Prague, Faculty of Electrical Engineering},
year = 2024,
URL = {https://dspace.cvut.cz/handle/10467/115227}
}
```
提供机构:
CIIRC-NLP
原始信息汇总
数据集概述
数据集名称
Czech GSM8K
数据集特征
- question: 数据类型为字符串
- answer: 数据类型为整数(int64)
- thoughts: 数据类型为字符串
数据集分割
- train: 包含7473个样本,总大小为3815742字节
- test: 包含1319个样本,总大小为690959字节
数据集大小
- 下载大小: 2747390字节
- 数据集总大小: 4506701字节
语言
- 捷克语(cs)
许可证
- MIT
大小类别
- 1K<n<10K
数据集结构
- 原始的answer字段被拆分为thoughts和answer两个字段。thoughts包含去除计算命令的思维过程字符串,answer包含最终的数值答案。
引用
- 原始数据集引用: Cobbe et al., 2021



