legesher/language-decoded-data
收藏Hugging Face2026-03-31 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/legesher/language-decoded-data
下载链接
链接失效反馈官方服务:
资源简介:
---
language:
- en
- zh
- es
- ur
license: apache-2.0
task_categories:
- text-generation
tags:
- code
- multilingual
- legesher
- transpilation
- tiny-aya-expedition
- language-decoded
pretty_name: Language Decoded Data
size_categories:
- 100K<n<1M
configs:
- config_name: condition-1-en-32k
data_files:
- split: train
path: data/condition-1-en-32k/train-*
- split: validation
path: data/condition-1-en-32k/validation-*
- config_name: condition-1-en-5k
data_files:
- split: train
path: data/condition-1-en-5k/train-*
- split: validation
path: data/condition-1-en-5k/validation-*
- config_name: condition-2-es-32k
data_files:
- split: train
path: data/condition-2-es-32k/train-*
- split: validation
path: data/condition-2-es-32k/validation-*
- config_name: condition-2-es-5k
data_files:
- split: train
path: data/condition-2-es-5k/train-*
- split: validation
path: data/condition-2-es-5k/validation-*
- config_name: condition-2-ur-32k
data_files:
- split: train
path: data/condition-2-ur-32k/train-*
- split: validation
path: data/condition-2-ur-32k/validation-*
- config_name: condition-2-ur-5k
data_files:
- split: train
path: data/condition-2-ur-5k/train-*
- split: validation
path: data/condition-2-ur-5k/validation-*
- config_name: condition-2-zh-32k
data_files:
- split: train
path: data/condition-2-zh-32k/train-*
- split: validation
path: data/condition-2-zh-32k/validation-*
- config_name: condition-2-zh-5k
data_files:
- split: train
path: data/condition-2-zh-5k/train-*
- split: validation
path: data/condition-2-zh-5k/validation-*
- config_name: condition-3-zh-5k
data_files:
- split: train
path: data/condition-3-zh-5k/train-*
- split: validation
path: data/condition-3-zh-5k/validation-*
- config_name: condition-4-zh-5k
data_files:
- split: train
path: data/condition-4-zh-5k/train-*
- split: validation
path: data/condition-4-zh-5k/validation-*
dataset_info:
- config_name: condition-1-en-32k
features:
- name: file_path
dtype: string
- name: code
dtype: string
- name: code_en
dtype: string
- name: language
dtype: string
- name: license
dtype: string
- name: token_count
dtype: int32
splits:
- name: train
num_bytes: 403718262
num_examples: 31818
- name: validation
num_bytes: 42626910
num_examples: 3536
download_size: 164619518
dataset_size: 446345172
- config_name: condition-1-en-5k
features:
- name: file_path
dtype: string
- name: code
dtype: string
- name: code_en
dtype: string
- name: language
dtype: string
- name: license
dtype: string
- name: token_count
dtype: int32
splits:
- name: train
num_bytes: 55261555
num_examples: 4500
- name: validation
num_bytes: 6365959
num_examples: 500
download_size: 22897728
dataset_size: 61627514
- config_name: condition-2-es-32k
features:
- name: file_path
dtype: string
- name: code
dtype: string
- name: code_en
dtype: string
- name: language
dtype: string
- name: license
dtype: string
- name: token_count
dtype: int32
splits:
- name: train
num_bytes: 408041994
num_examples: 31818
- name: validation
num_bytes: 43090956
num_examples: 3536
download_size: 166000000
dataset_size: 451132950
- config_name: condition-2-es-5k
features:
- name: file_path
dtype: string
- name: code
dtype: string
- name: code_en
dtype: string
- name: language
dtype: string
- name: license
dtype: string
- name: token_count
dtype: int32
splits:
- name: train
num_bytes: 55864731
num_examples: 4500
- name: validation
num_bytes: 6432095
num_examples: 500
download_size: 23031674
dataset_size: 62296826
- config_name: condition-2-ur-32k
features:
- name: file_path
dtype: string
- name: code
dtype: string
- name: code_en
dtype: string
- name: language
dtype: string
- name: license
dtype: string
- name: token_count
dtype: int32
splits:
- name: train
num_bytes: 415552907
num_examples: 31818
- name: validation
num_bytes: 43879443
num_examples: 3536
download_size: 166000000
dataset_size: 459432350
- config_name: condition-2-ur-5k
features:
- name: file_path
dtype: string
- name: code
dtype: string
- name: code_en
dtype: string
- name: language
dtype: string
- name: license
dtype: string
- name: token_count
dtype: int32
splits:
- name: train
num_bytes: 56906247
num_examples: 4500
- name: validation
num_bytes: 6545730
num_examples: 500
download_size: 23158039
dataset_size: 63451977
- config_name: condition-2-zh-32k
features:
- name: file_path
dtype: string
- name: code
dtype: string
- name: code_en
dtype: string
- name: language
dtype: string
- name: license
dtype: string
- name: token_count
dtype: int32
splits:
- name: train
num_bytes: 405515831
num_examples: 31818
- name: validation
num_bytes: 45065811
num_examples: 3536
download_size: 165387142
dataset_size: 450581642
- config_name: condition-2-zh-5k
features:
- name: file_path
dtype: string
- name: code
dtype: string
- name: code_en
dtype: string
- name: language
dtype: string
- name: license
dtype: string
- name: token_count
dtype: int32
splits:
- name: train
num_bytes: 55793642
num_examples: 4500
- name: validation
num_bytes: 6422792
num_examples: 500
download_size: 22978834
dataset_size: 62216434
- config_name: condition-3-zh-5k
features:
- name: file_path
dtype: large_string
- name: code
dtype: large_string
- name: code_en
dtype: string
- name: language
dtype: large_string
- name: license
dtype: large_string
- name: token_count
dtype: int64
- name: source_type
dtype: large_string
splits:
- name: train
num_bytes: 40782466
num_examples: 4500
- name: validation
num_bytes: 4531385
num_examples: 500
download_size: 17299185
dataset_size: 45313851
- config_name: condition-4-zh-5k
features:
- name: filename
dtype: string
- name: content
dtype: string
- name: extension
dtype: string
- name: source
dtype: string
- name: quality_tier
dtype: string
- name: sha256
dtype: string
- name: byte_size
dtype: int64
- name: total_lines
dtype: int64
- name: cjk_ratio
dtype: float64
- name: has_cjk
dtype: bool
splits:
- name: train
num_bytes: 44246508
num_examples: 6553
- name: validation
num_bytes: 7522476
num_examples: 729
download_size: 18300000
dataset_size: 51768984
---
# Language Decoded | Multilingual Code Dataset
Multilingual Python code datasets for the **Language Decoded** project (part of [Cohere's Tiny Aya Expedition](https://aya.for.ai)), investigating whether code's reasoning benefit for language models is **language-dependent** or **structure-dependent**.
## Research Question
> Does fine-tuning on non-English code (Python with translated keywords) improve multilingual reasoning as much as English code does?
Prior work ([Aryabumi et al., 2024 -- "To Code or Not to Code"](https://arxiv.org/abs/2408.10914)) demonstrated that including English code in pre-training data improves downstream reasoning performance by approximately 8%. However, that study only tested English code. This dataset enables the natural follow-up: does the reasoning benefit come from the _structure_ of code, or from the _language_ of its keywords?
## Dataset Description
This dataset provides filtered, quality-controlled Python source code in multiple configurations: the original English, three keyword-swapped variants (Chinese, Spanish, Urdu), a blended native+transpiled mix, and strictly native Chinese code. The source data is drawn from [bigcode/the-stack-dedup](https://huggingface.co/datasets/bigcode/the-stack-dedup) (Python subset), filtered for quality using the following criteria:
- AST-valid Python only (must parse without errors)
- Permissive licenses only (MIT, Apache-2.0, BSD, etc.)
- 10--1000 lines of code
- Minimum 21 GitHub stars
- No autogenerated files
- SHA-256 deduplication
Keyword-swapped variants are produced using [Legesher](https://github.com/legesher/legesher) v0.7.3, which translates Python reserved words (37 keywords, 72 builtins, 66 exceptions) into the target language while preserving code structure and semantics.
## Available Configs
Each condition is available in two sizes: `-32k` (full filtered corpus, ~31.8k train + ~3.5k validation) and `-5k` (stratified subset, 4.5k train + 500 validation). The `-5k` subsets are used for QLoRA fine-tuning on consumer GPUs.
| Config | Condition | Language | Description | Train | Val |
| -------------------- | ----------- | -------- | ------------------------------------------------------------ | ------ | ----- |
| `condition-1-en-32k` | 1 (control) | English | Unmodified filtered Python from The Stack Dedup | 31,818 | 3,536 |
| `condition-1-en-5k` | 1 (control) | English | Stratified 5k subset of condition-1 | 4,500 | 500 |
| `condition-2-zh-32k` | 2 | Chinese | Keyword-swapped Python via Legesher v0.7.3 | 31,818 | 3,536 |
| `condition-2-zh-5k` | 2 | Chinese | Stratified 5k subset of condition-2-zh | 4,500 | 500 |
| `condition-2-es-32k` | 2 | Spanish | Keyword-swapped Python via Legesher v0.7.3 | 31,818 | 3,536 |
| `condition-2-es-5k` | 2 | Spanish | Stratified 5k subset of condition-2-es | 4,500 | 500 |
| `condition-2-ur-32k` | 2 | Urdu | Keyword-swapped Python via Legesher v0.7.3 | 31,818 | 3,536 |
| `condition-2-ur-5k` | 2 | Urdu | Stratified 5k subset of condition-2-ur | 4,500 | 500 |
| `condition-3-zh-5k` | 3 | Chinese | Blended: 3,486 native Chinese code + 1,514 transpiled Python | 4,500 | 500 |
| `condition-4-zh-5k` | 4 | Chinese | Strictly native Chinese code (no transpiled code) | 6,553 | 729 |
## Schema
### Conditions 1--2
Used by: `condition-1-en-*`, `condition-2-zh-*`, `condition-2-es-*`, `condition-2-ur-*`
| Column | Type | Description |
| ------------- | ------ | ---------------------------------------------------------------------------------------------------------------------------------------------------- |
| `code` | string | Python source code. For condition-2 configs, this is the transpiled (keyword-swapped) version. For condition-1, this is the original English source. |
| `code_en` | string | Original English Python source code. Identical to `code` for condition-1-en. |
| `language` | string | ISO 639-1 language code: `en`, `ur`, `zh`, or `es`. |
| `file_path` | string | Original file path in The Stack Dedup. |
| `license` | string | SPDX license identifier for the source file. |
| `token_count` | int64 | Token count computed using the CohereLabs/tiny-aya-base tokenizer. |
### Condition 3
Used by: `condition-3-zh-5k`
Condition 3 blends native Chinese code with transpiled code and adds a `source_type` column to distinguish them. `code_en` is populated for transpiled rows (keeping them in sync with conditions 1--2) but null for native code rows, which have no English equivalent.
| Column | Type | Description |
| ------------- | ----------- | ---------------------------------------------------------------------------------- |
| `file_path` | string | File identifier (native filename or transpiled file path) |
| `code` | string | The code content (native or transpiled) |
| `code_en` | string/null | English original -- populated for transpiled rows, null for native code rows |
| `language` | string | ISO 639-1 language code (`zh`) |
| `license` | string | Source license (SPDX identifier, `UNKNOWN`, or `varies`) |
| `token_count` | int64 | Token count computed using the CohereLabs/tiny-aya-base tokenizer |
| `source_type` | string | `"native"` (natively Chinese-authored) or `"transpiled"` (keyword-swapped English) |
### Condition 4
Used by: `condition-4-zh-5k`
Condition 4 contains strictly native Chinese code -- code written by developers who think and code in Chinese. This uses the same schema as the [language-decoded-community](https://huggingface.co/datasets/legesher/language-decoded-community) dataset rather than the transpilation schema, since there is no English original to reference.
| Column | Type | Description |
| -------------- | ------- | -------------------------------------------------------------- |
| `filename` | string | Original filename |
| `content` | string | The code content |
| `extension` | string | File extension (e.g., `.py`, `.c`, `.wenyan`) |
| `source` | string | Data source (e.g., `thestack`, `wenyan`, `program_in_chinese`) |
| `quality_tier` | string | Quality rating: `A` (highest) through `D` (lowest) |
| `sha256` | string | SHA-256 hash for deduplication |
| `byte_size` | int64 | File size in bytes |
| `total_lines` | int64 | Total line count |
| `cjk_ratio` | float64 | Ratio of CJK characters in the file |
| `has_cjk` | bool | Whether the file contains CJK characters |
## Experimental Conditions
The Language Decoded experiment uses a ladder of conditions to isolate the mechanism behind code's reasoning benefit:
| Condition | Name | Purpose |
| ----------- | -------------------- | ----------------------------------------------------------------------------------------- |
| Baseline | No fine-tuning | Establishes the performance floor |
| Condition 1 | English code | Tests whether code fine-tuning helps at all (replicates Aryabumi et al.) |
| Condition 2 | Keyword-swapped code | Tests whether the _language_ of keywords matters for the reasoning benefit |
| Condition 3 | Mixed native sources | Tests whether diverse native-language code adds value beyond keyword swapping |
| Condition 4 | Strictly native code | Tests whether code authored by native speakers carries unique signal beyond transpilation |
### The Experimental Ladder
- **Baseline --> 1**: Does code help at all?
- **1 --> 2**: Does the language of keywords matter?
- **2 --> 3**: Does diversity of native-language sources add value beyond keyword swap?
- **3 --> 4**: Does code written in the cultural context of a language carry something that transpiled+mixed can't?
## Usage
```python
from datasets import load_dataset
# Load full-size English code (control)
ds = load_dataset("legesher/language-decoded-data", "condition-1-en-32k")
# Load 5k subset (for QLoRA fine-tuning)
ds = load_dataset("legesher/language-decoded-data", "condition-1-en-5k")
# Load keyword-swapped variants
ds = load_dataset("legesher/language-decoded-data", "condition-2-zh-5k")
ds = load_dataset("legesher/language-decoded-data", "condition-2-es-5k")
ds = load_dataset("legesher/language-decoded-data", "condition-2-ur-5k")
# Load blended native + transpiled (condition 3)
ds = load_dataset("legesher/language-decoded-data", "condition-3-zh-5k")
# Load strictly native code (condition 4)
ds = load_dataset("legesher/language-decoded-data", "condition-4-zh-5k")
# Access splits
train = ds["train"]
val = ds["validation"]
# Filter condition-3 by source type
native_only = train.filter(lambda x: x["source_type"] == "native")
```
## Technical Details
| Parameter | Value |
| ---------------------- | ------------------------------------------------------------------------------------------------------------------ |
| Source dataset | [bigcode/the-stack-dedup](https://huggingface.co/datasets/bigcode/the-stack-dedup) (Python subset) |
| Transpilation tool | [Legesher](https://github.com/legesher/legesher) v0.7.3 (legesher-core, legesher-i18n) |
| Tokenizer | CohereLabs/tiny-aya-base |
| Base model | [CohereLabs/tiny-aya-base](https://huggingface.co/CohereLabs/tiny-aya-base) (3.35B params) |
| Train/validation split | 90% / 10% (seed 42) |
| File format | Parquet (snappy compression) |
| Filtering criteria | AST-valid, permissive licenses, 10--1000 lines, min 21 GitHub stars, no autogenerated files, SHA-256 deduplication |
## Limitations
- **Source bias**: The Stack Dedup skews toward popular, well-starred GitHub repositories, which may not represent the full diversity of Python code in the wild.
- **Keyword-only transpilation**: Legesher translates Python reserved words (keywords, builtins, exceptions) but leaves comments, docstrings, string literals, and variable/function names in their original language (typically English). This means condition-2 code is a hybrid of translated keywords and English identifiers.
- **Token count variation**: Transpiled code may have different token counts than the English original due to multi-byte characters (especially for Chinese and Urdu), even though the code structure is identical.
- **Single programming language**: Currently limited to Python. Results may not generalize to other programming languages.
- **Condition 4 scope**: Native Chinese code is limited to publicly available sources (The Stack, Wenyan, Program-in-Chinese, Qi, Mulan) and may not represent the full spectrum of Chinese-language programming.
## Citation
```bibtex
@misc{language-decoded-2026,
title={Language Decoded: Investigating Language-Dependent vs. Structure-Dependent Reasoning Benefits of Code},
author={Madison Edgar and Saad Ahmed Bazaz and Tom Sherborne and Rashik Shahjahan and Khojasteh Mirza and Sarah Jawaid and Rafay Mustafa and Sohaib Ahmed Bazaz},
year={2026},
publisher={Hugging Face},
url={https://huggingface.co/datasets/legesher/language-decoded-data}
}
```
## Links
- [Legesher on GitHub](https://github.com/legesher/legesher)
- [Tiny Aya Expedition](https://aya.for.ai)
- [bigcode/the-stack-dedup](https://huggingface.co/datasets/bigcode/the-stack-dedup)
- [Language Decoded Community (native code)](https://huggingface.co/datasets/legesher/language-decoded-community)
- [Language Decoded Experiments (tracking)](https://huggingface.co/datasets/legesher/language-decoded-experiments)
- [Language Decoded LoRA (model hub)](https://huggingface.co/legesher/language-decoded-lora)
## License
Apache 2.0
### 数据集元信息
支持语言:
- 英语(en)
- 中文(zh)
- 西班牙语(es)
- 乌尔都语(ur)
许可证:Apache-2.0
任务类别:
- 文本生成
标签:
- 代码
- 多语言
- Legesher
- 代码转译(transpilation)
- Tiny Aya 探索计划(Tiny Aya Expedition)
- Language Decoded
数据集名称:Language Decoded 数据集
样本规模范围:100,000 < 样本数量 < 1,000,000
# 语言解码 | 多语言代码数据集
本数据集为**语言解码(Language Decoded)**项目(隶属于[Cohere的Tiny Aya探索计划(Tiny Aya Expedition)](https://aya.for.ai))提供多语言Python代码数据集,旨在探究代码对大语言模型的推理增益究竟依赖于代码语言还是代码结构。
## 研究问题
> 仅使用非英语代码(关键字经过翻译的Python代码)进行微调,能否像英语代码一样有效提升多语言推理能力?
此前的研究([Aryabumi等人,2024 ——《To Code or Not to Code》](https://arxiv.org/abs/2408.10914))表明,在预训练数据中加入英语代码可使下游推理性能提升约8%,但该研究仅测试了英语代码。本数据集为这一后续研究提供了支持:推理增益究竟源自代码的结构,还是代码关键字的语言?
## 数据集描述
本数据集提供经过过滤与质量管控的多配置Python源代码,涵盖原始英语代码、三种关键字替换变体(中文、西班牙语、乌尔都语)、原生与转译代码的混合版本,以及纯原生中文代码。数据集源数据取自[bigcode/the-stack-dedup](https://huggingface.co/datasets/bigcode/the-stack-dedup)的Python子集,并通过以下标准进行质量过滤:
- 仅包含通过抽象语法树(Abstract Syntax Tree, AST)校验的Python代码(可无错误解析)
- 仅包含宽松许可证代码(MIT、Apache-2.0、BSD等)
- 代码行数介于10至1000行之间
- 至少获得21个GitHub星标
- 不含自动生成的文件
- 通过SHA-256进行去重
关键字替换变体通过[Legesher](https://github.com/legesher/legesher) v0.7.3生成,该工具可将Python保留字(37个关键字、72个内置函数、66个异常类)转换为目标语言,同时保留代码结构与语义。
## 可用配置
每种实验条件均提供两种规模:`-32k`(完整过滤语料库,约31.8k训练样本 + 约3.5k验证样本)与`-5k`(分层采样子集,4.5k训练样本 + 500验证样本)。`-5k`子集可用于消费级GPU上的QLoRA微调。
| 配置名称 | 实验条件 | 语言 | 说明 | 训练样本数 | 验证样本数 |
| -------------------- | ----------- | -------- | ------------------------------------------------------------ | ------ | ----- |
| `condition-1-en-32k` | 1(对照组) | 英语 | 取自The Stack Dedup的未修改过滤后Python代码 | 31,818 | 3,536 |
| `condition-1-en-5k` | 1(对照组) | 英语 | condition-1的分层5k采样子集 | 4,500 | 500 |
| `condition-2-zh-32k` | 2 | 中文 | 通过Legesher v0.7.3生成的关键字替换Python代码 | 31,818 | 3,536 |
| `condition-2-zh-5k` | 2 | 中文 | condition-2-zh的分层5k采样子集 | 4,500 | 500 |
| `condition-2-es-32k` | 2 | 西班牙语 | 通过Legesher v0.7.3生成的关键字替换Python代码 | 31,818 | 3,536 |
| `condition-2-es-5k` | 2 | 西班牙语 | condition-2-es的分层5k采样子集 | 4,500 | 500 |
| `condition-2-ur-32k` | 2 | 乌尔都语 | 通过Legesher v0.7.3生成的关键字替换Python代码 | 31,818 | 3,536 |
| `condition-2-ur-5k` | 2 | 乌尔都语 | condition-2-ur的分层5k采样子集 | 4,500 | 500 |
| `condition-3-zh-5k` | 3 | 中文 | 混合版本:3,486段原生中文代码 + 1,514段转译Python代码 | 4,500 | 500 |
| `condition-4-zh-5k` | 4 | 中文 | 纯原生中文代码(无转译代码) | 6,553 | 729 |
## Schema
### 实验条件1-2
适用配置:`condition-1-en-*`、`condition-2-zh-*`、`condition-2-es-*`、`condition-2-ur-*`
| 列名 | 类型 | 说明 |
| ------------- | ------ | ---------------------------------------------------------------------------------------------------------------------------------------------------- |
| `code` | 字符串 | Python源代码。对于condition-2配置,该字段为关键字替换后的转译版本;对于condition-1配置,该字段为原始英语源代码。 |
| `code_en` | 字符串 | 原始英语Python源代码。对于condition-1-en配置,该字段与`code`完全一致。 |
| `language` | 字符串 | ISO 639-1语言代码:`en`、`ur`、`zh`或`es`。 |
| `file_path` | 字符串 | 源数据在The Stack Dedup中的原始文件路径。 |
| `license` | 字符串 | 源文件的SPDX许可证标识符。 |
| `token_count` | 64位整数 | 使用CohereLabs/tiny-aya-base分词器计算的Token计数。 |
### 实验条件3
适用配置:`condition-3-zh-5k`
实验条件3混合了原生中文代码与转译代码,并新增`source_type`字段以区分二者。对于转译代码行,`code_en`字段将被填充(与实验条件1-2保持一致);对于原生代码行,`code_en`字段为空,因为无对应的英语原始版本。
| 列名 | 类型 | 说明 |
| ------------- | ----------- | ---------------------------------------------------------------------------------- |
| `file_path` | 字符串 | 文件标识符(原生文件名或转译文件路径) |
| `code` | 字符串 | 代码内容(原生或转译) |
| `code_en` | 字符串/空值 | 英语原始版本——仅转译代码行填充该字段,原生代码行该字段为空 |
| `language` | 字符串 | ISO 639-1语言代码(`zh`) |
| `license` | 字符串 | 源许可证(SPDX标识符、`UNKNOWN`或`varies`) |
| `token_count` | 64位整数 | 使用CohereLabs/tiny-aya-base分词器计算的Token计数 |
| `source_type` | 字符串 | `"native"`(原生中文编写)或 `"transpiled"`(关键字替换的英语转译代码) |
### 实验条件4
适用配置:`condition-4-zh-5k`
实验条件4仅包含纯原生中文代码——即由以中文为母语的开发者编写的代码。该配置采用与[language-decoded-community](https://huggingface.co/datasets/legesher/language-decoded-community)数据集一致的Schema,而非转译代码的Schema,因为不存在可参考的英语原始版本。
| 列名 | 类型 | 说明 |
| -------------- | ------- | -------------------------------------------------------------- |
| `filename` | 字符串 | 原始文件名 |
| `content` | 字符串 | 代码内容 |
| `extension` | 字符串 | 文件扩展名(例如:`.py`、`.c`、`.wenyan`) |
| `source` | 字符串 | 数据来源(例如:`thestack`、`wenyan`、`program_in_chinese`) |
| `quality_tier` | 字符串 | 质量评级:`A`(最高)至`D`(最低) |
| `sha256` | 字符串 | 用于去重的SHA-256哈希值 |
| `byte_size` | 64位整数 | 文件大小(单位:字节) |
| `total_lines` | 64位整数 | 总行数 |
| `cjk_ratio` | 64位浮点数 | 文件中CJK字符的占比 |
| `has_cjk` | 布尔值 | 文件是否包含CJK字符 |
## 实验条件
语言解码实验通过分层实验条件隔离代码推理增益的背后机制:
| 实验条件 | 名称 | 用途 |
| ----------- | -------------------- | ----------------------------------------------------------------------------------------- |
| 基线 | 无微调 | 建立性能基准下限 |
| 条件1 | 英语代码 | 测试代码微调是否能有效提升性能(复现Aryabumi等人的研究) |
| 条件2 | 关键字替换代码 | 测试关键字的语言是否会影响推理增益 |
| 条件3 | 混合原生代码源 | 测试多样化的原生语言代码是否能为关键字替换代码带来额外价值 |
| 条件4 | 纯原生代码 | 测试以母语编写的代码是否带有转译与混合代码无法涵盖的独特信号 |
### 实验分层逻辑
- **基线 → 条件1**:代码微调是否能带来性能提升?
- **条件1 → 条件2**:关键字的语言是否会影响推理增益?
- **条件2 → 条件3**:多样化的原生语言代码是否能为关键字替换代码带来额外价值?
- **条件3 → 条件4**:以特定语言文化语境编写的代码,是否带有转译与混合代码无法复刻的独特信息?
## 使用方法
python
from datasets import load_dataset
# 加载完整规模的英语代码(对照组)
ds = load_dataset("legesher/language-decoded-data", "condition-1-en-32k")
# 加载5k采样子集(用于QLoRA微调)
ds = load_dataset("legesher/language-decoded-data", "condition-1-en-5k")
# 加载关键字替换变体数据集
ds = load_dataset("legesher/language-decoded-data", "condition-2-zh-5k")
ds = load_dataset("legesher/language-decoded-data", "condition-2-es-5k")
ds = load_dataset("legesher/language-decoded-data", "condition-2-ur-5k")
# 加载原生与转译代码混合数据集(条件3)
ds = load_dataset("legesher/language-decoded-data", "condition-3-zh-5k")
# 加载纯原生中文代码数据集(条件4)
ds = load_dataset("legesher/language-decoded-data", "condition-4-zh-5k")
# 访问数据集划分
train = ds["train"]
val = ds["validation"]
# 按source_type过滤condition-3数据集的原生代码样本
native_only = train.filter(lambda x: x["source_type"] == "native")
## 技术细节
| 参数 | 数值 |
| ---------------------- | ------------------------------------------------------------------------------------------------------------------ |
| 源数据集 | [bigcode/the-stack-dedup](https://huggingface.co/datasets/bigcode/the-stack-dedup)的Python子集 |
| 转译工具 | [Legesher](https://github.com/legesher/legesher) v0.7.3(包含legesher-core、legesher-i18n模块) |
| 分词器 | CohereLabs/tiny-aya-base |
| 基础模型 | [CohereLabs/tiny-aya-base](https://huggingface.co/CohereLabs/tiny-aya-base)(33.5亿参数) |
| 训练/验证划分比例 | 90% / 10%(随机种子:42) |
| 文件格式 | Parquet(采用snappy压缩) |
| 过滤标准 | 可通过AST校验、宽松许可证、10~1000行代码、至少21个GitHub星标、不含自动生成文件、SHA-256去重 |
## 局限性
- **源数据偏差**:The Stack Dedup偏向于高星标的热门GitHub仓库,无法代表真实世界中Python代码的全部多样性。
- **仅关键字转译**:Legesher仅翻译Python保留字(关键字、内置函数、异常类),而注释、文档字符串、字符串字面量以及变量/函数名仍保留原始语言(通常为英语)。因此条件2的代码实为关键字翻译与英语标识符的混合体。
- **Token计数差异**:由于多字节字符的存在(尤其是中文与乌尔都语),转译后的代码与原始英语代码的Token计数可能存在差异,尽管二者代码结构完全一致。
- **仅支持单一编程语言**:当前数据集仅包含Python代码,结果可能无法推广至其他编程语言。
- **条件4的范围限制**:原生中文代码仅取自公开数据源(The Stack、Wenyan、Program-in-Chinese、Qi、Mulan),无法涵盖中文编程的全部场景。
## 引用
bibtex
@misc{language-decoded-2026,
title={Language Decoded: Investigating Language-Dependent vs. Structure-Dependent Reasoning Benefits of Code},
author={Madison Edgar and Saad Ahmed Bazaz and Tom Sherborne and Rashik Shahjahan and Khojasteh Mirza and Sarah Jawaid and Rafay Mustafa and Sohaib Ahmed Bazaz},
year={2026},
publisher={Hugging Face},
url={https://huggingface.co/datasets/legesher/language-decoded-data}
}
## 相关链接
- [Legesher 开源仓库](https://github.com/legesher/legesher)
- [Tiny Aya 探索计划](https://aya.for.ai)
- [bigcode/the-stack-dedup 数据集](https://huggingface.co/datasets/bigcode/the-stack-dedup)
- [语言解码社区(原生代码数据集)](https://huggingface.co/datasets/legesher/language-decoded-community)
- [语言解码实验跟踪仓库](https://huggingface.co/datasets/legesher/language-decoded-experiments)
- [语言解码 LoRA 模型仓库](https://huggingface.co/legesher/language-decoded-lora)
## 许可证
Apache 2.0
提供机构:
legesher
搜集汇总
数据集介绍

构建方式
在探索代码对语言模型推理能力提升机制的研究背景下,Language Decoded数据集通过精心设计的流程构建而成。其核心源数据取自bigcode/the-stack-dedup中的Python代码子集,并应用了严格的质量筛选标准,包括确保代码符合抽象语法树规范、仅采用宽松许可协议、控制代码行数在10至1000行之间、要求项目至少获得21个GitHub星标,并排除自动生成的文件,最后通过SHA-256哈希进行去重处理。为探究关键词语言的影响,数据集利用Legesher工具将Python的保留字、内置函数及异常类翻译成中文、西班牙语和乌尔都语,从而生成结构相同但关键词语言各异的代码变体。
特点
该数据集在代码语义理解与多语言处理交叉领域展现出鲜明特色。其核心特征在于提供了四种实验配置:原始的英文代码作为对照组,以及经过关键词替换的中文、西班牙语和乌尔都语版本,这为检验代码的推理益处究竟源于其结构还是关键词的语言提供了直接对比基础。数据集规模适中,包含数万条样本,并严格划分了训练集与验证集。尤为独特的是,其Condition 3配置创新性地融合了原生中文代码与翻译生成的代码,并标注了来源类型,为研究多语言代码的混合效应提供了宝贵资源。
使用方法
针对代码增强语言模型推理能力的实证研究,该数据集提供了清晰的使用路径。研究者可通过Hugging Face的datasets库,指定相应的配置名称(如condition-1-en、condition-2-zh等)来加载不同的实验条件数据。加载后,可便捷地访问训练集和验证集以进行模型微调。对于Condition 3这类混合来源的数据,可利用内置的过滤功能,根据source_type字段区分原生代码与翻译代码,从而进行更精细的分析。该数据集直接服务于“语言解码”项目,旨在通过对照实验揭示代码益处的本质。
背景与挑战
背景概述
在自然语言处理与代码智能交叉领域,探究代码数据对语言模型推理能力的提升机制是一个前沿课题。Language Decoded数据集由Legesher团队于2026年创建,作为Cohere Tiny Aya Expedition项目的重要组成部分,旨在系统研究代码对模型推理的增益究竟源于其语言特性还是结构特性。该数据集基于bigcode/the-stack-dedup中的Python代码子集,通过严格的质控筛选与多语言关键词替换,构建了包含英语、中文、西班牙语和乌尔都语四个版本的平行语料。其核心科学问题是验证非英语代码是否与英语代码一样,能够有效增强语言模型的多语言推理性能,从而深化对代码本质与模型认知机制的理解。
当前挑战
该数据集致力于解决代码增强语言模型推理能力这一领域问题的核心挑战,即辨析代码的增益效应是依赖于其关键词的语言符号,还是其抽象的逻辑结构。构建过程中的挑战主要体现在多语言代码资源的创建与质量控制上:首先,需确保关键词替换后的代码在语法与语义上保持严格等价,这依赖于Legesher工具链的精确性与鲁棒性;其次,从海量开源代码中筛选出高质量、符合许可要求且具备一定流行度的样本,涉及复杂的静态分析与去重流程;此外,为探究原生代码与转译代码的差异,还需整合少量真实的多语言原生代码,其获取、标注与对齐工作亦增加了数据集的构建复杂度。
常用场景
经典使用场景
在代码增强语言模型推理能力的多语言研究中,Language Decoded数据集被广泛应用于探究代码对模型推理的增益机制。该数据集通过提供英语、中文、西班牙语和乌尔都语四种语言变体的Python代码,使研究者能够设计对照实验,以评估模型在微调后于多语言推理任务上的表现差异。经典使用场景包括训练多语言基础模型,比较不同语言代码对模型逻辑思维和结构化问题解决能力的提升效果,从而验证代码增益是依赖于语言本身还是代码的抽象结构。
解决学术问题
该数据集旨在解决自然语言处理领域一个核心学术问题:代码数据对语言模型推理能力的提升是否依赖于特定语言。先前研究仅验证了英语代码的增益效果,而本数据集通过引入关键词替换的多语言代码变体,使得研究者能够分离语言因素与结构因素的作用。这深化了对代码数据在预训练中作用机制的理解,为多语言模型的高效训练提供了实证基础,推动了代码与自然语言交叉研究的发展。
衍生相关工作
围绕该数据集衍生的经典工作主要包括对代码数据在多语言模型预训练中角色的深入探索。例如,基于条件对比实验的研究进一步分析了代码结构泛化性与语言特定性之间的平衡,并催生了针对低资源语言编程支持的模型优化方法。相关研究也扩展至代码翻译、跨语言代码语义保持等领域,为Legesher等代码国际化工具的发展提供了数据支撑与评估基准。
以上内容由遇见数据集搜集并总结生成



