five

drewoodward/spanglish-sentences

收藏
Hugging Face2026-04-21 更新2026-04-26 收录
下载链接:
https://hf-mirror.com/datasets/drewoodward/spanglish-sentences
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: cc-by-4.0 language: - es - en - multilingual tags: - code-switching - spanglish - bilingual - spanish-english - machine-translation - synthetic pretty_name: Spanglish Sentences task_categories: - translation - text-generation size_categories: - 10K<n<100K --- # Spanglish Sentences A dataset of **10,576** Spanish–English code-switched ("Spanglish") sentences paired with English translations, intended for training and evaluating code-switch translation models. ## Data format Each line of `spanglish_sentences.jsonl` is a JSON object with two fields: | field | description | |---|---| | `sentence` | A Spanglish utterance (mixed Spanish / English, or monolingual in either language). | | `english_translation` | The English translation. When the source is already English, it is reproduced unchanged. | Example: ```json {"sentence": "él podía escoger o una inyección o unas pastillas", "english_translation": "He could either pick an injection or some pills."} {"sentence": "yeah como un asesino porque ya ellos tenían su comunidad ahí", "english_translation": "Yeah like a killer because they already had their community there."} ``` ## Provenance Sentences and their English translations were **generated by a large language model**. This is a synthetic dataset; none of the content corresponds to real speakers or recordings. ## Intended use Training and evaluating **code-switch translation** systems (Spanglish → English). ## Limitations - **Synthetic**: linguistic patterns may not faithfully reflect natural Spanglish usage in any specific community (Miami, Caribbean, Mexican-American, Chicano, etc.). Evaluate against a human-produced test set before drawing conclusions about real-world performance. - **Translation quality is LLM-generated** and has not been human-verified. Expect noise, including cases where the "translation" simply copies the source. - **Punctuation, capitalization, and orthography are inconsistent** (some sentences lack punctuation entirely, some mix casing). - Many lines are short fillers (`"yeah"`, `"you know"`) where source and translation are identical — filter these out if your task requires non-trivial translation pairs. ## License Released under the Creative Commons Attribution 4.0 license (CC BY 4.0).
提供机构:
drewoodward
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作