Diwanshuydv/NL2MANIM_Dataset

Name: Diwanshuydv/NL2MANIM_Dataset
Creator: Diwanshuydv
Published: 2026-04-20 20:16:02
License: 暂无描述

Hugging Face2026-04-20 更新2026-04-26 收录

下载链接：

https://hf-mirror.com/datasets/Diwanshuydv/NL2MANIM_Dataset

下载链接

链接失效反馈

官方服务：

资源简介：

--- # For reference on dataset card metadata, see the spec: https://github.com/huggingface/hub-docs/blob/main/datasetcard.md?plain=1 # Doc / guide: https://huggingface.co/docs/hub/datasets-cards {} --- # NL2MANIM_Dataset This dataset maps natural-language scene descriptions and prompts to executable Manim Community Edition (ManimCE) Python code. It was prepared to train/finetune instruction-following models for generating Manim animations from textual specifications. **Key facts** - **Curated by:** Repository owner (see project repository) - **Dataset name:** NL2MANIM_Dataset - **Size:** 8,352 JSONL examples (training file: `data/dataset_peft_train.jsonl`) - **Language(s):** English - **License:** Not specified (please add an explicit license before public redistribution) ## Dataset Details ### Dataset Description NL2MANIM_Dataset contains examples where a natural-language prompt (often framed as a `<TEXT_SCRIPT>` description) requests a Manim scene, and the target output is an executable Manim Python snippet (typically wrapped in a `<CODE>` block). The examples are formatted as line-delimited JSON (JSONL), intended for instruction tuning / PEFT workflows. - Typical use cases: training code-generation models, instruction-following fine-tuning, automated generation of educational animations. - Not designed for natural-language understanding benchmark evaluation beyond code generation tasks. ### Dataset Sources - **Repository / origin:** Created and maintained inside this project workspace. ## Uses ### Direct Use - Fine-tuning or instruct-tuning language models to generate Manim animations from textual descriptions. - Developing systems that convert pedagogical text into visual animations. ### Out-of-Scope Use - Composing datasets for general NLP tasks unrelated to code or animation generation. - Using without license clearance if any examples include third-party copyrighted content. ## Dataset Structure Each line in the JSONL is a single training example (one JSON object). Observed schema (examples): - `messages` (array): sequence of chat-style messages. Each message is an object with keys: - `role`: one of `system`, `user`, `assistant`. - `content`: string containing the message text. For `user`, content typically includes the natural-language scene description (often wrapped in a `<TEXT_SCRIPT>` block). For `assistant`, content is the target Manim code (often wrapped in `<CODE>` and fenced with a Python block). Example (abbreviated) ```json { "messages": [ {"role":"system","content":"You are an expert Manim..."}, {"role":"user","content":"<TEXT_SCRIPT>Scene Sequence: 1. Display the title... </TEXT_SCRIPT>"}, {"role":"assistant","content":"<CODE>```python\nfrom manim import *\nclass Scene...\n```</CODE>"} ] } ``` ### Splits The repository currently includes a single training JSONL at `data/dataset_peft_train.jsonl` containing 8,352 examples. No official validation/test splits are provided in the repo — you may create held-out splits for evaluation. ## Dataset Creation ### Curation Rationale The dataset was created to provide pairs of instruction-like prompts and executable Manim code to enable models to learn mapping from pedagogical natural language to animation-generating code. ### Data Collection and Processing - Examples appear to be authored/semi-structured: prompts follow a consistent `<TEXT_SCRIPT>` pattern and assistant outputs contain Manim Python code inside `<CODE>` blocks. - Minimal normalization performed: content preserved as text/code blocks; no attempt at executing or linting every code sample when assembling the JSONL. ### Who are the source data producers? - The examples appear authored/curated by repository contributors (project maintainers / dataset curator). No external annotator metadata is present. ## Annotations - None beyond the contained `messages` chat structure. ## Bias, Risks, and Limitations - The dataset is narrowly focused on Manim code generation; models trained on it will specialize in this domain and may not generalize to other code tasks. - Some code examples may contain content that is copyrighted or that embeds third-party code snippets — verify licensing before public release. - Code in the dataset may include unsafe patterns if executed (e.g., arbitrary Python execution). Do not run untrusted examples without review. ### Recommendations - Add an explicit license file and dataset license metadata before publishing. - Create a curated validation split and perform linting/formatting checks on generated code samples if you plan to run them. ## Citation No formal paper. If you use this dataset, cite the repository and include a short description of how examples were generated. ## More Information For the raw training file, see: [data/dataset_peft_train.jsonl](data/dataset_peft_train.jsonl) ## Dataset Card Authors - Maintainer: repository owner / contributors ## Contact - Please open an issue in the project repository or contact the dataset maintainer listed in the repository.

提供机构：

Diwanshuydv

5,000+

优质数据集

54 个

任务类型

进入经典数据集