Diwanshuydv/NL2MANIM_Dataset
收藏Hugging Face2026-04-20 更新2026-04-26 收录
下载链接:
https://hf-mirror.com/datasets/Diwanshuydv/NL2MANIM_Dataset
下载链接
链接失效反馈官方服务:
资源简介:
---
# For reference on dataset card metadata, see the spec: https://github.com/huggingface/hub-docs/blob/main/datasetcard.md?plain=1
# Doc / guide: https://huggingface.co/docs/hub/datasets-cards
{}
---
# NL2MANIM_Dataset
This dataset maps natural-language scene descriptions and prompts to executable Manim Community Edition (ManimCE) Python code. It was prepared to train/finetune instruction-following models for generating Manim animations from textual specifications.
**Key facts**
- **Curated by:** Repository owner (see project repository)
- **Dataset name:** NL2MANIM_Dataset
- **Size:** 8,352 JSONL examples (training file: `data/dataset_peft_train.jsonl`)
- **Language(s):** English
- **License:** Not specified (please add an explicit license before public redistribution)
## Dataset Details
### Dataset Description
NL2MANIM_Dataset contains examples where a natural-language prompt (often framed as a `<TEXT_SCRIPT>` description) requests a Manim scene, and the target output is an executable Manim Python snippet (typically wrapped in a `<CODE>` block). The examples are formatted as line-delimited JSON (JSONL), intended for instruction tuning / PEFT workflows.
- Typical use cases: training code-generation models, instruction-following fine-tuning, automated generation of educational animations.
- Not designed for natural-language understanding benchmark evaluation beyond code generation tasks.
### Dataset Sources
- **Repository / origin:** Created and maintained inside this project workspace.
## Uses
### Direct Use
- Fine-tuning or instruct-tuning language models to generate Manim animations from textual descriptions.
- Developing systems that convert pedagogical text into visual animations.
### Out-of-Scope Use
- Composing datasets for general NLP tasks unrelated to code or animation generation.
- Using without license clearance if any examples include third-party copyrighted content.
## Dataset Structure
Each line in the JSONL is a single training example (one JSON object). Observed schema (examples):
- `messages` (array): sequence of chat-style messages. Each message is an object with keys:
- `role`: one of `system`, `user`, `assistant`.
- `content`: string containing the message text. For `user`, content typically includes the natural-language scene description (often wrapped in a `<TEXT_SCRIPT>` block). For `assistant`, content is the target Manim code (often wrapped in `<CODE>` and fenced with a Python block).
Example (abbreviated)
```json
{
"messages": [
{"role":"system","content":"You are an expert Manim..."},
{"role":"user","content":"<TEXT_SCRIPT>Scene Sequence: 1. Display the title... </TEXT_SCRIPT>"},
{"role":"assistant","content":"<CODE>```python\nfrom manim import *\nclass Scene...\n```</CODE>"}
]
}
```
### Splits
The repository currently includes a single training JSONL at `data/dataset_peft_train.jsonl` containing 8,352 examples. No official validation/test splits are provided in the repo — you may create held-out splits for evaluation.
## Dataset Creation
### Curation Rationale
The dataset was created to provide pairs of instruction-like prompts and executable Manim code to enable models to learn mapping from pedagogical natural language to animation-generating code.
### Data Collection and Processing
- Examples appear to be authored/semi-structured: prompts follow a consistent `<TEXT_SCRIPT>` pattern and assistant outputs contain Manim Python code inside `<CODE>` blocks.
- Minimal normalization performed: content preserved as text/code blocks; no attempt at executing or linting every code sample when assembling the JSONL.
### Who are the source data producers?
- The examples appear authored/curated by repository contributors (project maintainers / dataset curator). No external annotator metadata is present.
## Annotations
- None beyond the contained `messages` chat structure.
## Bias, Risks, and Limitations
- The dataset is narrowly focused on Manim code generation; models trained on it will specialize in this domain and may not generalize to other code tasks.
- Some code examples may contain content that is copyrighted or that embeds third-party code snippets — verify licensing before public release.
- Code in the dataset may include unsafe patterns if executed (e.g., arbitrary Python execution). Do not run untrusted examples without review.
### Recommendations
- Add an explicit license file and dataset license metadata before publishing.
- Create a curated validation split and perform linting/formatting checks on generated code samples if you plan to run them.
## Citation
No formal paper. If you use this dataset, cite the repository and include a short description of how examples were generated.
## More Information
For the raw training file, see: [data/dataset_peft_train.jsonl](data/dataset_peft_train.jsonl)
## Dataset Card Authors
- Maintainer: repository owner / contributors
## Contact
- Please open an issue in the project repository or contact the dataset maintainer listed in the repository.
提供机构:
Diwanshuydv



