five

American English Spoken Lexicon

收藏
Mendeley Data2024-01-31 更新2024-06-28 收录
下载链接:
https://catalog.ldc.upenn.edu/LDC99L23
下载链接
链接失效反馈
官方服务:
资源简介:
Introduction This lexicon contains pronunciations captured in individual audio files for 53,602 of the most common words in English. Data 50,892 words were chosen from LDC's CALLHOME American English Lexicon on the basis of their frequency in the data that were used in creating the 1994 CSR Language Model Text Corpus ("CSR-III Text Corpus," LDC95T6). The sources for the language model include Wall Street Journal (1987-1994), Associated Press (1989-1991), and San Jose Mercury News (1991); all taken from the three CD-ROM volumes of TIPSTER (LDC93T3A). To extend the coverage of common words that happen not to occur in the LDC corpora sampled, an additional 2,922 words (ie. compounds, companies, places, languages, and numerals) were added from other sources. Each word was read by the speaker in a quiet recording studio, using a Sennheiser HMD 410 microphone and a Sony DAT recorder. The recordings were downsampled to 16KHz for storage on disk with the individual lexical utterances segmented into separate waveform files, with a consistent margin of silence on both sides of each word. The CD-ROMs were created using the ISO-9660 Level 2 data format, along with Rock Ridge extensions. All common computer operating systems should be able to read the full-length file names. The corpus has since been converted to a web downloaded file. Updates There are no updates at this time.
创建时间:
2024-01-31
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作