ECI Multilingual Text
收藏DataCite Commons2021-07-01 更新2025-04-16 收录
下载链接:
https://catalog.ldc.upenn.edu/LDC94T5
下载链接
链接失效反馈官方服务:
资源简介:
<p>The first release of the European Corpus Initiative, the Multilingual Corpus 1 (ECI/MCI), has 46 subcorpora in 27 (mainly European) languages. The total size of these is roughly 92 million (lexical) words. The corpora are marked up using TEI P2 conformant SGML (to varying levels of detail), with easy access to the source text without markup. Twelve of the component corpora are multilingual parallel corpora with from two to nine sub-corpora. All the alphabetic corpora (there is some Japanese and Chinese) are encoded in the ISO LATIN family of 8-bit character sets (ISO 8859-1, -5 and -7). The CD-ROM is in High Sierra format (ISO 9660), readable on UNIX, MSDOS and Apple systems at least.</p><br>
<p>The amount of material per language varies, from about 36 million words (German) to about 5 thousand words (Bulgarian). The majority of sources are journalistic in nature (newspapers, magazines, broadcasts) additional sources include dictionaries (Albanian, Gaelic, Turkish, Japanese/English), literature, technical reports and proceedings or publications of international organizations. The table on the next page lists the languages included, the subcorpus numbers for each language (in parentheses) and the amount of data per language in thousands of lexical words.</p><br>
<p>Language (Subcorpus #) Kwords Totals <br />German (70) 34291 (09) 191 (65) 20 (28) 187 (29) 59 (30) 76 (47) 24 (59) 50 (71) 21 (70A) 999 35918 <br />French (31) 4775 (04) 4121 (28) 187 (29) 59 (30) 76 (47) 24 (51) 6 (59) 50 (71) 21 (32) 1667 10986 <br />Spanish (31) 4500 (13) 830 (14) 1041 (15) 447 (47) 24 (32) 1667 8 (59) 50 (71) 8580 <br />English (31) 4222 (36) 1141 (74) 95 (28) 187 (47) 24 (51) 6 (56) 97 (59) 50 (71) 21 (32) 1667 7510 <br />Dutch (03) 5500 (02) 600 (47) 24 (71) 21 6145 <br />Czech (44) 4726 4726 <br />Italian (11) 3518 (42) 303 (58) 13 (29) 59 (30) 76 (47) 24 (71) 21 4014 <br />Chinese (78) 2895 2895 <br />Greek (10) 2515 (47) 24 (59) 50 (71) 21 2610 <br />Norwegian (41) 2226 2226 <br />Swedish (37) 1718 1718 <br />Serb/Croat/Slov(24) 700 (56) 289 989 <br />Tibetan (76) 834 834 <br />Portuguese (60) 675 (47) 24 (71) 21 720 <br />Malay (80) 563 563<br />Russian (73) 364 364 <br />Japanese (57) 203 203 <br />Turkish (20) 173 (20A) 110 283 <br />Albanian (82) 205 205 <br />Gaelic (55) 141 141 <br />Estonian (39) 100 100 <br />Usbek (81) 88 88 <br />Latin (74) 75 75 <br />Danish (47) 24 (71) 21 45 <br />Lithuanian (89) 20 20 <br />Bulgarian (84) 5 5 <br />Total 91969</p></br>
Portions © 1994 Trustees of the University of Pennsylvania
提供机构:
Linguistic Data Consortium
创建时间:
2020-11-30



