five

RaivisDejus/latvian-text

收藏
Hugging Face2023-04-01 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/RaivisDejus/latvian-text
下载链接
链接失效反馈
官方服务:
资源简介:
--- annotations_creators: - found language: - lv language_creators: - found license: - cc-by-4.0 multilinguality: - monolingual pretty_name: Latvian text dataset size_categories: - 10K<n<100K source_datasets: - extended|tilde_model - extended|wikipedia - extended|europarl_bilingual tags: - lv - latvian task_categories: - automatic-speech-recognition task_ids: [] --- # Latvian text dataset Data set of latvian language texts. Intended for use in AI tool development, like speech recognition or spellcheckers ## Data sources used * Latvian Wikisource articles - https://wikisource.org/wiki/Category:Latvian * Literary works of Rainis - https://repository.clarin.lv/repository/xmlui/handle/20.500.12574/41 * Latvian Wikipedia articles - https://huggingface.co/datasets/joelito/EU_Wikipedias * European Parliament Proceedings Parallel Corpus - https://huggingface.co/datasets/europarl_bilingual * Tilde MODEL Corpus - Multilingual Open Data for European Languages - https://huggingface.co/datasets/tilde_model To get Wikipedia dataset (197MB) run. ``` python tools/wikipedia/GetWikipedia.py ``` To get Europarl dataset (1.7GB) run. ``` python tools/europarl/GetEuroparl.py ``` To get Tilde dataset (834MB) run. ``` python tools/europarl/GetTilde.py ``` To combine all datasets run ``` sh combine-all.sh ``` To clean out some junk run. ``` sh clean.sh ``` Also maybe you want to remove duplocate lines. To do so run ``` sort lv.txt | uniq > lv-uniq.txt ``` ## Notes Possible future sources * Parliament proceedings transcripts - https://www.saeima.lv/lv/transcripts * Discussions of Latvian Wikipedia pages - https://lv.wikipedia.org/wiki/Special:AllPages * Out of copyright books from LNB collection - https://data.gov.lv/dati/lv/dataset/gramatu-digitala-kolekcija Data sets not used * Web scrapes, as they tend to yield data from comments with improper spelling like "atrashanaas vieta" instead of "atrašanās vieta" * Open Subtitles, as they contain data with improper spelling like "atrashanaas vieta" instead of "atrašanās vieta" Possible issues: * Data sets contain foreign language characters, like "蠻子" or cyrilic f.e. "Рига"
提供机构:
RaivisDejus
原始信息汇总

Latvian text dataset

基本信息

  • 语言: 拉脱维亚语 (lv)
  • 许可证: CC-BY-4.0
  • 多语言性: 单语种
  • 数据集大小: 10K<n<100K
  • 美观名称: Latvian text dataset

数据来源

  • 拉脱维亚维基文库文章: https://wikisource.org/wiki/Category:Latvian
  • Rainis的文学作品: https://repository.clarin.lv/repository/xmlui/handle/20.500.12574/41
  • 拉脱维亚维基百科文章: https://huggingface.co/datasets/joelito/EU_Wikipedias
  • 欧洲议会进程平行语料库: https://huggingface.co/datasets/europarl_bilingual
  • Tilde MODEL Corpus: https://huggingface.co/datasets/tilde_model

任务类别

  • 自动语音识别
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作