five

CoRoLa Frequency Lists

收藏
NIAID Data Ecosystem2026-03-14 收录
下载链接:
https://zenodo.org/record/7091534
下载链接
链接失效反馈
官方服务:
资源简介:
The Reference Corpus for Contemporary Romanian Language (CoRoLa) was constructed as a priority project of the Romanian Academy. It contains both written texts and oral recordings. Its aim is to cover major functional language styles (legal, scientific, journalistic, imaginative, memoirs, administrative), in four domains (arts and culture, nature, society, science) and in 71 sub-domains while taking into account intellectual property rights (IPR). With over 1 billion word tokens (written and spoken), CoRoLa is one of the largest fully IPR-cleared Reference Corpus in the world. https://corola.racai.ro    This dataset contains multiple frequency lists extracted from CoRoLa. There are 12 word-based frequency lists and 12 lemma-based frequency lists. These were constructed only from tokens containing letters (tokens with numbers or special symbols were excluded). Lemmatization was performed automatically at corpus level using the TTL tool. The following files are available: corola_word_freq_all  frequency list for all tokens, as they appear in the corpus corola_word_freq_all_nodiacritics frequency list for all tokens, with diacritics removed (replaced with ASCII corresponding letters) corola_word_freq_all_lowercase frequency list for all tokens lowercased corola_word_freq_all_lowercase_nodiacritics frequency list for all tokens lowercased and with diacritics removed corola_word_freq_gte5  frequency list for tokens appearing at least 5 times in the corpus corola_word_freq_gte5_nodiacritics frequency list for tokens appearing at least 5 times in the corpus, with diacritics removed (replaced with ASCII corresponding letters) corola_word_freq_gte5_lowercase frequency list for tokens appearing at least 5 times in the corpus, lowercased corola_word_freq_gte5_lowercase_nodiacritics frequency list for tokens appearing at least 5 times in the corpus, lowercased and with diacritics removed corola_word_freq_gte10  frequency list for tokens appearing at least 10 times in the corpus corola_word_freq_gte10_nodiacritics frequency list for tokens appearing at least 10 times in the corpus, with diacritics removed (replaced with ASCII corresponding letters) corola_word_freq_gte10_lowercase frequency list for tokens appearing at least 10 times in the corpus, lowercased corola_word_freq_gte10_lowercase_nodiacritics frequency list for tokens appearing at least 10 times in the corpus, lowercased and with diacritics removed corola_lemma_freq_all  frequency list for all lemmas, as they appear in the corpus corola_lemma_freq_all_nodiacritics frequency list for all lemmas, with diacritics removed (replaced with ASCII corresponding letters) corola_lemma_freq_all_lowercase frequency list for all lemmas lowercased corola_lemma_freq_all_lowercase_nodiacritics frequency list for all lemmas lowercased and with diacritics removed corola_lemma_freq_gte5  frequency list for lemmas appearing at least 5 times in the corpus corola_lemma_freq_gte5_nodiacritics frequency list for lemmas appearing at least 5 times in the corpus, with diacritics removed (replaced with ASCII corresponding letters) corola_lemma_freq_gte5_lowercase frequency list for lemmas appearing at least 5 times in the corpus, lowercased corola_lemma_freq_gte5_lowercase_nodiacritics frequency list for lemmas appearing at least 5 times in the corpus, lowercased and with diacritics removed corola_lemma_freq_gte10  frequency list for lemmas appearing at least 10 times in the corpus corola_lemma_freq_gte10_nodiacritics frequency list for lemmas appearing at least 10 times in the corpus, with diacritics removed (replaced with ASCII corresponding letters) corola_lemma_freq_gte10_lowercase frequency list for lemmas appearing at least 10 times in the corpus, lowercased corola_lemma_freq_gte10_lowercase_nodiacritics frequency list for lemmas appearing at least 10 times in the corpus, lowercased and with diacritics removed   Number of entries in each of the released files File # Entries corola_lemma_freq_all_lowercase_nodiacritics 1,375,725 corola_lemma_freq_all_lowercase 1,457,518 corola_lemma_freq_all_nodiacritics 1,562,523 corola_lemma_freq_all 1,635,250 corola_lemma_freq_gte10_lowercase_nodiacritics 227,590 corola_lemma_freq_gte10_lowercase 235,234 corola_lemma_freq_gte10_nodiacritics 242,325 corola_lemma_freq_gte10 248,593 corola_lemma_freq_gte5_lowercase_nodiacritics 351,596 corola_lemma_freq_gte5_lowercase 365,463 corola_lemma_freq_gte5_nodiacritics 380,751 corola_lemma_freq_gte5 392,053 corola_word_freq_all_lowercase_nodiacritics 1,685,410 corola_word_freq_all_lowercase 1,813,746 corola_word_freq_all_nodiacritics 2,112,107 corola_word_freq_all 2,260,992 corola_word_freq_gte10_lowercase_nodiacritics 358,577 corola_word_freq_gte10_lowercase 381,715 corola_word_freq_gte10_nodiacritics 447,538 corola_word_freq_gte10 473,087 corola_word_freq_gte5_lowercase_nodiacritics 517,630 corola_word_freq_gte5_lowercase 553,031 corola_word_freq_gte5_nodiacritics 650,971 corola_word_freq_gte5 690,676
创建时间:
2022-09-20
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作