drewoodward/spanglish-sentences
收藏Hugging Face2026-04-21 更新2026-04-26 收录
下载链接:
https://hf-mirror.com/datasets/drewoodward/spanglish-sentences
下载链接
链接失效反馈官方服务:
资源简介:
---
license: cc-by-4.0
language:
- es
- en
- multilingual
tags:
- code-switching
- spanglish
- bilingual
- spanish-english
- machine-translation
- synthetic
pretty_name: Spanglish Sentences
task_categories:
- translation
- text-generation
size_categories:
- 10K<n<100K
---
# Spanglish Sentences
A dataset of **10,576** Spanish–English code-switched ("Spanglish") sentences paired with English translations, intended for training and evaluating code-switch translation models.
## Data format
Each line of `spanglish_sentences.jsonl` is a JSON object with two fields:
| field | description |
|---|---|
| `sentence` | A Spanglish utterance (mixed Spanish / English, or monolingual in either language). |
| `english_translation` | The English translation. When the source is already English, it is reproduced unchanged. |
Example:
```json
{"sentence": "él podía escoger o una inyección o unas pastillas", "english_translation": "He could either pick an injection or some pills."}
{"sentence": "yeah como un asesino porque ya ellos tenían su comunidad ahí", "english_translation": "Yeah like a killer because they already had their community there."}
```
## Provenance
Sentences and their English translations were **generated by a large language model**. This is a synthetic dataset; none of the content corresponds to real speakers or recordings.
## Intended use
Training and evaluating **code-switch translation** systems (Spanglish → English).
## Limitations
- **Synthetic**: linguistic patterns may not faithfully reflect natural Spanglish usage in any specific community (Miami, Caribbean, Mexican-American, Chicano, etc.). Evaluate against a human-produced test set before drawing conclusions about real-world performance.
- **Translation quality is LLM-generated** and has not been human-verified. Expect noise, including cases where the "translation" simply copies the source.
- **Punctuation, capitalization, and orthography are inconsistent** (some sentences lack punctuation entirely, some mix casing).
- Many lines are short fillers (`"yeah"`, `"you know"`) where source and translation are identical — filter these out if your task requires non-trivial translation pairs.
## License
Released under the Creative Commons Attribution 4.0 license (CC BY 4.0).
提供机构:
drewoodward



