language-decoded-experiments

Hugging Face2026-03-23 更新2026-03-24 收录

下载链接：

https://huggingface.co/datasets/legesher/language-decoded-experiments

下载链接

链接失效反馈

官方服务：

资源简介：

该数据集是“Language Decoded”项目的实验跟踪中心，包含训练日志、配置、评估结果和分析。研究问题是：在非英语代码（使用翻译关键字的Python）上进行微调是否与英语代码一样能提高多语言推理能力。目标语言包括中文（zh）、西班牙语（es）和乌尔都语（ur）。数据集结构包括多个实验条件，每个条件逐步隔离一个变量，以研究代码结构或关键字语言的影响。实验使用CohereLabs/tiny-aya-base作为基础模型，采用QLoRA 4-bit方法进行训练。评估基准包括MGSM（数学推理）、X-CSQA（常识推理）和XNLI（自然语言推理）。数据集包含不同条件下的训练数据和结果，适用于多语言代码处理和推理任务的研究。

This dataset serves as the experimental tracking center for the "Language Decoded" project, encompassing training logs, configuration files, evaluation results, and analytical reports. The central research inquiry is: Does fine-tuning on non-English code (Python with translated keywords) enhance multilingual reasoning capabilities as effectively as fine-tuning on English code? The target languages include Chinese (zh), Spanish (es), and Urdu (ur). The dataset is structured with multiple experimental conditions, each of which sequentially isolates a single variable to investigate the impact of code structure or keyword language. The experiments use CohereLabs/tiny-aya-base as the base model, and adopt the QLoRA 4-bit training method. The evaluation benchmarks include MGSM (Mathematical Reasoning), X-CSQA (Common Sense Reasoning), and XNLI (Natural Language Inference). The dataset contains training data and results under different experimental conditions, and is suitable for research on multilingual code processing and reasoning tasks.

创建时间：

2026-03-13

5,000+

优质数据集

54 个

任务类型

进入经典数据集