CoRoLa Frequency Lists
收藏NIAID Data Ecosystem2026-03-14 收录
下载链接:
https://zenodo.org/record/7091534
下载链接
链接失效反馈官方服务:
资源简介:
The Reference Corpus for Contemporary Romanian Language (CoRoLa) was constructed as a priority project of the Romanian Academy. It contains both written texts and oral recordings. Its aim is to cover major functional language styles (legal, scientific, journalistic, imaginative, memoirs, administrative), in four domains (arts and culture, nature, society, science) and in 71 sub-domains while taking into account intellectual property rights (IPR). With over 1 billion word tokens (written and spoken), CoRoLa is one of the largest fully IPR-cleared Reference Corpus in the world. https://corola.racai.ro
This dataset contains multiple frequency lists extracted from CoRoLa. There are 12 word-based frequency lists and 12 lemma-based frequency lists. These were constructed only from tokens containing letters (tokens with numbers or special symbols were excluded). Lemmatization was performed automatically at corpus level using the TTL tool. The following files are available:
corola_word_freq_all frequency list for all tokens, as they appear in the corpus
corola_word_freq_all_nodiacritics frequency list for all tokens, with diacritics removed (replaced with ASCII corresponding letters)
corola_word_freq_all_lowercase frequency list for all tokens lowercased
corola_word_freq_all_lowercase_nodiacritics frequency list for all tokens lowercased and with diacritics removed
corola_word_freq_gte5 frequency list for tokens appearing at least 5 times in the corpus
corola_word_freq_gte5_nodiacritics frequency list for tokens appearing at least 5 times in the corpus, with diacritics removed (replaced with ASCII corresponding letters)
corola_word_freq_gte5_lowercase frequency list for tokens appearing at least 5 times in the corpus, lowercased
corola_word_freq_gte5_lowercase_nodiacritics frequency list for tokens appearing at least 5 times in the corpus, lowercased and with diacritics removed
corola_word_freq_gte10 frequency list for tokens appearing at least 10 times in the corpus
corola_word_freq_gte10_nodiacritics frequency list for tokens appearing at least 10 times in the corpus, with diacritics removed (replaced with ASCII corresponding letters)
corola_word_freq_gte10_lowercase frequency list for tokens appearing at least 10 times in the corpus, lowercased
corola_word_freq_gte10_lowercase_nodiacritics frequency list for tokens appearing at least 10 times in the corpus, lowercased and with diacritics removed
corola_lemma_freq_all frequency list for all lemmas, as they appear in the corpus
corola_lemma_freq_all_nodiacritics frequency list for all lemmas, with diacritics removed (replaced with ASCII corresponding letters)
corola_lemma_freq_all_lowercase frequency list for all lemmas lowercased
corola_lemma_freq_all_lowercase_nodiacritics frequency list for all lemmas lowercased and with diacritics removed
corola_lemma_freq_gte5 frequency list for lemmas appearing at least 5 times in the corpus
corola_lemma_freq_gte5_nodiacritics frequency list for lemmas appearing at least 5 times in the corpus, with diacritics removed (replaced with ASCII corresponding letters)
corola_lemma_freq_gte5_lowercase frequency list for lemmas appearing at least 5 times in the corpus, lowercased
corola_lemma_freq_gte5_lowercase_nodiacritics frequency list for lemmas appearing at least 5 times in the corpus, lowercased and with diacritics removed
corola_lemma_freq_gte10 frequency list for lemmas appearing at least 10 times in the corpus
corola_lemma_freq_gte10_nodiacritics frequency list for lemmas appearing at least 10 times in the corpus, with diacritics removed (replaced with ASCII corresponding letters)
corola_lemma_freq_gte10_lowercase frequency list for lemmas appearing at least 10 times in the corpus, lowercased
corola_lemma_freq_gte10_lowercase_nodiacritics frequency list for lemmas appearing at least 10 times in the corpus, lowercased and with diacritics removed
Number of entries in each of the released files
File
# Entries
corola_lemma_freq_all_lowercase_nodiacritics
1,375,725
corola_lemma_freq_all_lowercase
1,457,518
corola_lemma_freq_all_nodiacritics
1,562,523
corola_lemma_freq_all
1,635,250
corola_lemma_freq_gte10_lowercase_nodiacritics
227,590
corola_lemma_freq_gte10_lowercase
235,234
corola_lemma_freq_gte10_nodiacritics
242,325
corola_lemma_freq_gte10
248,593
corola_lemma_freq_gte5_lowercase_nodiacritics
351,596
corola_lemma_freq_gte5_lowercase
365,463
corola_lemma_freq_gte5_nodiacritics
380,751
corola_lemma_freq_gte5
392,053
corola_word_freq_all_lowercase_nodiacritics
1,685,410
corola_word_freq_all_lowercase
1,813,746
corola_word_freq_all_nodiacritics
2,112,107
corola_word_freq_all
2,260,992
corola_word_freq_gte10_lowercase_nodiacritics
358,577
corola_word_freq_gte10_lowercase
381,715
corola_word_freq_gte10_nodiacritics
447,538
corola_word_freq_gte10
473,087
corola_word_freq_gte5_lowercase_nodiacritics
517,630
corola_word_freq_gte5_lowercase
553,031
corola_word_freq_gte5_nodiacritics
650,971
corola_word_freq_gte5
690,676
创建时间:
2022-09-20



