SAMPLE LATAM Data Suite | 1.8M+ Sentences | Natural Language Processing (NLP) Data | TTS | ...
收藏Databricks2025-10-09 收录
下载链接:
https://marketplace.databricks.com/details/2a2274f1-c6f1-47b8-a9af-ef2122f30120/Oxford-Languages_SAMPLE-LATAM-Data-Suite-1.8M+-Sentences-Natural-Language-Processing-(NLP)-Data-TTS-
下载链接
链接失效反馈官方服务:
资源简介:
LATAM Data Suite provides high-quality datasets in Spanish, Portuguese, and American English. Ideal for NLP, AI, LLMs, translation, and education, it combines linguistic depth and regional authenticity to power scalable, multilingual language technologies.
Discover our expertly curated language datasets in the LATAM Data Suite. Compiled and annotated by language and linguistic experts, this suite offers high-quality resources tailored to your needs. This suite includes:
- Monolingual and Bilingual Dictionary Data
Featuring headwords, definitions, word senses, part-of-speech (POS) tags, and semantic metadata.
- Sentences
Curated examples of real-world usage with contextual annotations.
- Synonyms & Antonyms
Lexical relations to support semantic search, paraphrasing, and language understanding.
- Audio Data
Native speaker recordings for TTS and pronunciation modeling.
- Word Lists
Frequency-ranked and thematically grouped lists.
Learn more about the datasets included in the data suite:
1. Portuguese Monolingual Dictionary Data
2. Portuguese Bilingual Dictionary Data
3. Spanish Monolingual Dictionary Data
4. Spanish Bilingual Dictionary Data
5. Spanish Sentences Data
6. Spanish Synonyms and Antonyms Data
7. Spanish Audio Data
8. Spanish Word List Data
9. American English Monolingual Dictionary Data
10. American English Synonyms and Antonyms Data
11. American English Pronunciations with Audio
Key Features (approximate numbers):
1. Portuguese Monolingual Dictionary Data
Our Portuguese monolingual covers both European and Latin American varieties, featuring clear definitions and examples, a large volume of headwords, and comprehensive coverage of the Portuguese language.
- Words: 143,600
- Senses: 285,500
- Example sentences: 69,300
- Format: XML format
- Delivery: Email (link-based file sharing)
2. Portuguese Bilingual Dictionary Data
The bilingual data provides translations in both directions, from English to Portuguese and from Portuguese to English. It is annually reviewed and updated by our in-house team of language experts. Offers comprehensive coverage of the language, providing a substantial volume of translated words of excellent quality that span both European and Latin American Portuguese varieties.
- Translations: 300,000
- Senses: 158,000
- Example translations: 117,800
- Format: XML and JSON formats
- Delivery: Email (link-based file sharing) and REST API
- Updated frequency: annually
3. Spanish Monolingual Dictionary Data
Our Spanish monolingual reliably offers clear definitions and examples, a large volume of headwords, and comprehensive coverage of the Spanish language.
- Words: 73,000
- Senses: 123,000
- Example sentences: 104,000
- Format: XML and JSON formats
- Delivery: Email (link-based file sharing) and REST API
- Updated frequency: annually
4. Spanish Bilingual Dictionary Data
The bilingual data provides translations in both directions, from English to Spanish and from Spanish to English. It is annually reviewed and updated by our in-house team of language experts. Offers significant coverage of the language, providing a large volume of translated words of excellent quality.
- Translations: 221,300
- Senses: 103,500
- Example sentences: 74,500
- Example translations: 83,800
- Format: XML and JSON formats
- Delivery: Email (link-based file sharing) and REST API
- Updated frequency: annually
5. Spanish Sentences Data
Spanish sentences retrieved from corpus are ideal for NLP model training, presenting approximately 20 million words. The sentences provide a great coverage of Spanish-speaking countries and are accordingly tagged to a particular country or dialect.
- Sentences volume: 1,840,000
- Format: XML and JSON formats
- Delivery: Email (link-based file sharing) and REST API
6. Spanish Synonyms and Antonyms Data
This Spanish language dataset offers a rich collection of synonyms and antonyms, accompanied by detailed definitions and part-of-speech (POS) annotations, making it a comprehensive resource for building linguistically aware AI systems and language technologies.
- Synonyms: 127,700
- Antonyms: 9,500
- Format: XML format
- Delivery: Email (link-based file sharing)
- Updated frequency: annually
7. Spanish Audio Data (word-level)
Curated word-level audio data for the Spanish language, which covers all varieties of world Spanish, providing rich dialectal diversity in the Spanish language.
- Audio files: 20,900
- Format: XLSX (for index), MP3 and WAV (audio files)
8. Spanish Word List Data
This language data contains a carefully curated and comprehensive list of 450,000 Spanish words.
- Wordforms: 450,000
- Format: CSV and TXT formats
- Delivery: Email (link-based file sharing)
9. American English Monolingual Dictionary Data
Our American English Monolingual Dictionary Data is the foremost authority on American English, including detailed tagging and labelling covering parts of speech (POS), grammar, region, register, and subject, providing rich linguistic information. Additionally, all grammar and usage information is present to ensure relevance and accuracy.
- Words: 140,000
- Senses: 222,000
- Example sentences: 140,000
- Format: XML and JSON formats
- Delivery: Email (link-based file sharing) and REST API
- Updated frequency: annually
10. American English Synonyms and Antonyms Data
The American English Synonyms and Antonyms Dataset is a leading resource offering comprehensive, up-to-date coverage of word relationships in contemporary American English. It includes rich linguistic detail such as precise definitions and part-of-speech (POS) tags, making it an essential asset for developing AI systems and language technologies that require deep semantic understanding.
- Synonyms: 600,000
- Antonyms: 22,000
- Format: XML and JSON formats
- Delivery: Email (link-based file sharing) and REST API
- Updated frequency: annually
11. American English Pronunciations with Audio (word-level)
This dataset provides IPA transcriptions and mapped audio files for words in contemporary American English, with a focus on US speaker usage. It includes syllabified transcriptions, variant spellings, part-of-speech tags, and pronunciation group identifiers. Audio files are supplied separately and linked where available – ideal for TTS, ASR, and pronunciation modeling.
- Transcriptions (IPA): 250,000
- Audio files: 180,000
- Format: XLSX (for transcriptions), MP3 and WAV (audio files)
- Updated frequency: annually
Use Cases:
We consistently work with our clients on new use cases as language technology continues to evolve. These include Natural Language Processing (NLP) applications, TTS, dictionary display tools, games, translations, word embedding, and word sense disambiguation (WSD).
If you have a specific use case in mind that isn't listed here, we’d be happy to explore it with you. Don’t hesitate to get in touch with us at Growth.OL@oup.com to start the conversation.
Pricing:
Oxford Languages offers flexible pricing based on use case and delivery format. Our datasets are licensed via term-based IP agreements and tiered pricing for API-delivered data. Whether you’re integrating into a product, training an LLM, or building custom NLP solutions, we tailor licensing to your specific needs.
Contact our team or email us at Growth.OL@oup.com to explore pricing options and discover how our language data can support your goals.
About the sample:
The samples offer a brief overview of one or two language datasets (monolingual or/and bilingual dictionary data). To help you explore the structure and features of our dataset, we provide a sample in CSV format for preview purposes only.
If you need the complete original sample or more details about any dataset, please contact us (Growth.OL@oup.com) to request access or further information
提供机构:
Oxford Languages



