mohdelgaar/LingGen
收藏Hugging Face2026-03-07 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/mohdelgaar/LingGen
下载链接
链接失效反馈官方服务:
资源简介:
---
language:
- en
pretty_name: LingGen
task_categories:
- text-generation
size_categories:
- 1M<n<10M
---
# LingGen
Official dataset release for the paper **LingGen: Scalable Multi-Attribute Linguistic Control via Power-Law Masking**.
This dataset contains the processed training and test data used for the LingGen experiments, with precomputed linguistic control vectors for each example.
## Dataset Summary
- `train`: 6,810,672 examples
- `test`: 2,000 examples
- `ling`: 40 released control attributes used by the public codebase
- `ling_all`: full 276-dimensional linguistic feature vector
Each example includes:
- `sentence`: target text
- `source`: source dataset identifier
- `ling`: released 40-attribute control vector
- `ling_all`: full feature vector before selecting the released subset
## Source Data
The processed examples are derived from public datasets used in the paper, including:
- C4
- SMF
- QQP
- ANLI
- MRPC
- STS-B
- RTE
This release redistributes processed text and derived linguistic features for research use. Users should consult the original source datasets for their respective licenses and usage terms.
## Usage
```python
from datasets import load_dataset
dataset = load_dataset("mohdelgaar/LingGen")
```
To use the dataset directly with the released code repository, save it to disk first:
```python
from datasets import load_dataset
dataset = load_dataset("mohdelgaar/LingGen")
dataset.save_to_disk("data/ling_sentences")
```
Code repository: https://github.com/CLU-UML/LingGen
## Citation
```bibtex
@misc{elgaar2026linggen,
title={LingGen: Scalable Multi-Attribute Linguistic Control via Power-Law Masking},
author={Mohamed Elgaar and Hadi Amiri},
year={2026}
}
```
提供机构:
mohdelgaar



