five

SAMPLE American English Language Datasets | 150+ Years of Research | Dictionary Data | Audio ...

收藏
Databricks2025-11-22 收录
下载链接:
https://marketplace.databricks.com/details/02df38b8-0b11-4ce5-b346-2188286a680a/Oxford-Languages_SAMPLE-American-English-Language-Datasets-150+-Years-of-Research-Dictionary-Data-Audio-
下载链接
链接失效反馈
官方服务:
资源简介:
Derived from over 150 years of lexical research, these comprehensive textual and audio data, focused on American English, provide linguistically annotated data. Ideal for NLP applications, LLM training and/or fine-tuning, as well as educational and game apps. One of our flagship datasets, the American English data is expertly curated and linguistically annotated by professionals, with annual updates to ensure accuracy and relevance. The below datasets in American English are available for license: 1. American English Monolingual Dictionary Data 2. American English Synonyms and Antonyms Data 3. American English Pronunciations with Audio Key Features (approximate numbers): 1. American English Monolingual Dictionary Data Our American English Monolingual Dictionary Data is the foremost authority on American English, including detailed tagging and labelling covering parts of speech (POS), grammar, region, register, and subject, providing rich linguistic information. Additionally, all grammar and usage information is present to ensure relevance and accuracy. - Headwords: 140,000 - Senses: 222,000 - Sentence examples: 140,000 - Format: XML and JSON format - Delivery: Email (link-based file sharing) and REST API - Updated frequency: annually 2. American English Synonyms and Antonyms Data The American English Synonyms and Antonyms Dataset is a leading resource offering comprehensive, up-to-date coverage of word relationships in contemporary American English. It includes rich linguistic details such as precise definitions and part-of-speech (POS) tags, making it an essential asset for developing AI systems and language technologies that require deep semantic understanding. - Synonyms: 600,000 - Antonyms: 22,000 - Format: XML and JSON format - Delivery: Email (link-based file sharing) and REST API - Updated frequency: annually 3. American English Pronunciations with Audio (word-level) This dataset provides IPA transcriptions and clean audio data in contemporary American English. It includes syllabified transcriptions, variant spellings, POS tags, and pronunciation group identifiers. The audio files are supplied separately and linked where available for seamless integration - perfect for teams building TTS systems, ASR models, and pronunciation engines. - Transcriptions (IPA): 250,000 - Audio files: 180,000 - Format: XLSX (for transcriptions), MP3 and WAV (audio files) - Updated frequency: annually Use Cases: We consistently work with our clients on new use cases as language technology continues to evolve. These include NLP applications, TTS, dictionary display tools, games, translation machine, AI training and fine-tuning, word embedding, and word sense disambiguation (WSD). If you have a specific use case in mind that isn't listed here, we’d be happy to explore it with you. Don’t hesitate to get in touch with us at Growth.OL@oup.com to start the conversation. Pricing: Oxford Languages offers flexible pricing based on use case and delivery format. Our datasets are licensed via term-based IP agreements and tiered pricing for API-delivered data. Whether you’re integrating into a product, training an LLM, or building custom NLP solutions, we tailor licensing to your specific needs. Contact our team or email us at Growth.OL@oup.com to explore pricing options and discover how our language data can support your goals. Please note that some datasets may have rights restrictions. Contact us for more information. About the sample: To help you explore the structure and features of our dataset on this platform, we provide a sample in CSV and/or JSON formats for one of the presented datasets, for preview purposes only, as shown on this page. This sample offers a quick and accessible overview of the data's contents and organization. Our full datasets are available in various formats, depending on the language and type of data you require. These may include XML, JSON, TXT, XLSX, CSV, WAV, MP3, and other file types. Please contact us (Growth.OL@oup.com) if you would like to receive the original sample with full details.

本数据集源自超过150年的词汇学研究,涵盖全面的文本与音频数据,聚焦美式英语(American English),提供经过语言学标注的数据集。其适用于自然语言处理(Natural Language Processing,简称NLP)应用、大语言模型(Large Language Model,简称LLM)的训练与微调,以及教育类与游戏类应用开发。 作为我们的旗舰数据集之一,该美式英语数据集由专业人士精心编选并完成语言学标注,且每年更新以确保数据的准确性与时效性。以下可授权使用的美式英语数据集包括: 1. 美式英语单语词典数据(American English Monolingual Dictionary Data) 2. 美式英语同义词与反义词数据(American English Synonyms and Antonyms Data) 3. 带音频的美式英语发音数据(American English Pronunciations with Audio) ### 关键特性(近似规模) 1. 美式英语单语词典数据 本美式英语单语词典数据是美式英语领域的权威资源,包含针对词性(Part of Speech,简称POS)、语法、地域、语域与主题的详细标注与标签,提供丰富的语言学信息。此外,所有语法与用法说明均完整收录,以保障内容的相关性与准确性。 - 词头(Headwords):140,000个 - 义项(Senses):222,000个 - 例句(Sentence examples):140,000个 - 格式:可扩展标记语言(XML)与JavaScript对象表示法(JSON)格式 - 交付方式:通过电子邮件(基于链接的文件共享)与表述性状态转移应用程序编程接口(REST API) - 更新频率:每年一次 2. 美式英语同义词与反义词数据 本美式英语同义词与反义词数据集是领先的语义资源,全面覆盖当代美式英语中的词汇语义关系,内容详实且与时俱进。其包含精准释义与词性标注等丰富语言学细节,是开发需要深度语义理解的人工智能系统与语言技术的核心资产。 - 同义词(Synonyms):600,000组 - 反义词(Antonyms):22,000组 - 格式:XML与JSON格式 - 交付方式:电子邮件(基于链接的文件共享)与REST API - 更新频率:每年一次 3. 带音频的美式英语发音数据(词级) 本数据集提供当代美式英语的国际音标(International Phonetic Alphabet,简称IPA)转写与清晰的音频数据,包含音节化转写、变体拼写、词性标注与发音组标识符。音频文件单独提供并在可用时附带链接,便于无缝集成,非常适合开发文本转语音(Text-to-Speech,简称TTS)系统、自动语音识别(Automatic Speech Recognition,简称ASR)模型与发音引擎的团队使用。 - IPA转写(Transcriptions (IPA)):250,000条 - 音频文件:180,000个 - 格式:用于转写的表格数据格式(XLSX),以及音频格式MP3与波形音频格式(WAV) - 更新频率:每年一次 ### 应用场景 随着语言技术的不断演进,我们持续与客户合作探索新的应用场景,涵盖NLP应用、TTS、词典展示工具、游戏、机器翻译、人工智能训练与微调、词嵌入以及词义消歧(Word Sense Disambiguation,简称WSD)等。若您有未在此列出的特定应用场景,我们很高兴与您共同探索。请随时通过Growth.OL@oup.com联系我们开启合作洽谈。 ### 定价方案 牛津语言(Oxford Languages)提供基于应用场景与交付方式的灵活定价模式。我们的数据集通过按期限的知识产权协议进行授权,API交付的数据采用分级定价机制。无论您是将数据集成至产品中、训练大语言模型,还是开发定制化的NLP解决方案,我们均可根据您的具体需求定制授权方案。请联系我们或发送邮件至Growth.OL@oup.com,探讨定价选项并了解我们的语言数据如何助力您达成目标。请注意,部分数据集可能存在权利限制,如需更多信息请联系我们。 ### 样本说明 为帮助您在本平台探索我们数据集的结构与特性,我们提供其中一个数据集的示例文件,格式为CSV和/或JSON,仅用于预览,如本页面所示。该示例可快速直观地展示数据集的内容与组织形式。我们的完整数据集支持多种格式,具体取决于您所需的语言与数据类型,包括XML、JSON、TXT、XLSX、CSV、WAV、MP3及其他文件格式。如需获取包含完整细节的原始示例,请联系我们(Growth.OL@oup.com)。
提供机构:
Oxford Languages
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作