five

ILE: Italian LExicon

收藏
catalogue.elra.info2005-05-02 更新2025-03-22 收录
下载链接:
https://catalogue.elra.info/en-us/repository/browse/ELRA-S0059/
下载链接
链接失效反馈
官方服务:
资源简介:
ILE is a 588,000 entries Italian lexicon transcribed with SAMPA notation. It was generated, mainly for speech recognition purposes, by means of a morphological analyzer handling more than 100,000 morphemes, each of them transcribed and manually checked. Each stem was combined with all its possible suffixes to form valid words. Verbal forms do not include clitics.The morpho-lexicon was obtained by properly processing an Italian dictionary, and adding by hand all possible inflections. This base lexicon was then enriched with names and neologisms found in the 65,000 most frequent words of the newspaper "Il Sole 24 Ore". Also the most frequent Italian proper names and surnames (from the telephone directory), geographical names, acronyms, company names, commonly used foreign words were added to the lexicon.All words are transcribed using SAMPA units for the Italian language. In case of multiple pronunciations for a word, one row for each different transcription is provided (a total of about 601,000 different transcriptions are provided for the 588,000 words lexicon). Stressed vowels are marked with the ASCII character ". Also foreign words are transcribed using only SAMPA units for the Italian language, which leads to some awkward but effective transcription, at least for speech recognition purposes.Some samples of ILE follow.ANCORA "a n k o r a ANCORA a n k "o r a CESSARE tS e ss "a r e CESSEREBBERO tS e ss e r "E bb e r o CITTA' tS i tt "a AIDS "a i d s AIDS a i d i "E ss e BABY-SITTER b E b i s "i tt e r BABY-SITTER b e i b i s "i tt e r BLUE-JEANS b l u dZ "i n s

ILE为一部包含58.8万条条目的意大利语词汇集,采用SAMPA符号进行转录。该词汇集主要为了语音识别目的而生成,通过处理超过10万个语素,每个语素均进行了转录和人工校对。每个词干与其所有可能的词缀相结合,形成有效的单词。言语形式不包括粘着词素。形态-词汇表是通过适当处理意大利语词典并手动添加所有可能的屈折形式而获得的。在此基础上,词汇集进一步丰富了来自《Il Sole 24 Ore》报纸中频率最高的6.5万个单词中的名称和新词。同时,还加入了最常见的意大利语专有名词和姓氏(来自电话簿)、地理名称、首字母缩略词、公司名称、常用外语等。所有单词均使用SAMPA单位进行意大利语转录。对于多音节的单词,为每个不同的转录提供一个条目(总计为58.8万个单词的词汇集提供了约601,000个不同的转录)。重音元音以ASCII字符“.”标出。此外,外语也仅使用SAMPA单位进行意大利语转录,这在语音识别目的上虽然略显笨拙但有效。以下是ILE的一些样本:ANCORA 'a n k o r a' ANCORA 'a n k o r a' CESSARE 'tS e ss a r e' CESSEREBBERO 'tS e ss e r E bb e r o' CITTA' 'tS i tt a' AIDS 'a i d s' AIDS 'a i d i E ss e' BABY-SITTER 'b e b i s i t t e r' BABY-SITTER 'b e i b i s i t t e r' BLUE-JEANS 'b l u dZ i n s'
提供机构:
ELRA Catalogue of Language Resources
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作