five

a3xrfgb/amharic-sentences-corpus

收藏
Hugging Face2026-02-27 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/a3xrfgb/amharic-sentences-corpus
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: apache-2.0 --- ![zz7eqhpgTF5HXxf5ybO_9](https://cdn-uploads.huggingface.co/production/uploads/6946bd29d36538f56ad17d29/cf5d6LungPELKN8hhKstW.jpeg) # Amharic Sentences Corpus V1.0 # Source: [Telegram](https://et.tgstat.com) This 1.6 million Amharic sentences corpus reflects current Amharic usage as of December 20, 2025, and is designed for anyone interested in: - Training Amharic-based LLMs - Fine-tuning NLP models - Building search, summarization, or generative systems in Amharic The dataset is heavily cleaned and normalized, but like any serious LLM dataset, it still needs proper tokenization for pre-training. I recommend using an Amharic-specific tokenizer such as: https://pypi.org/project/amharic-tokenizer/0.2.6 Another useful Amharic Sentences corpus by [@rasyosef](https://huggingface.co/datasets/rasyosef/amharic-sentences-corpus) & [@Addis AI](https://huggingface.co/datasets/addisai/wikipedia-amharic) # How did I create this text corpus? - I vibe coded a simple yet powerful script that creates a sentences from a .Json file that I downloaded from telegram channels. If you Want to create your own text corpus, feel free to use my script. https://github.com/a3xrfgb/HuggingFace_dataset_creator This project is fully open-source & community-driven. If you’re building in NLP, AI research, or language technology, this is for you. Use it. Improve it. Build on top of it. ## Topics ethiopia / ethiopian / ethiopiandataset / ethiopianvision / ethiopianimages / ethiopianphotography / ethiopianvisuals / ethiopianculture / ethiopianart / ethiopiandigitalculture / ethiopianmachinelearning / ethiopiancomputervision / ethiopiangenerativeai / ethiopiandiffusion / ethiopianstreetphotography / ethiopianportraits / ethiopianlifestyle / ethiopianurbanculture / ethiopiancreativephotography / ethiopianvisualarchive / ethiopianmodernculture / ethiopiandigitalart / abyssinia / abyssiniandataset / abyssinianvision / abyssinianai / abyssinianimages / abyssinianvisuals / abyssinianculture / abyssinianphotography / abyssinianmosaic / abyssinianarchive / habesha / habeshaai / habeshavision / habeshaculture / habeshavisuals / sheger / shegervision / addisababa / addisvisuals / addisphotography / africanai / africancomputervision / eastafricanai / africanvisualdataset / amharic / ኢትዮጵያ / አማርኛ / ሀበሻ
提供机构:
a3xrfgb
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作