five

Stanley03/swaweb

收藏
Hugging Face2026-04-04 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/Stanley03/swaweb
下载链接
链接失效反馈
官方服务:
资源简介:
--- language: sw license: cc-by-4.0 tags: - kiswahili - swahili - low-resource - web-corpus - nlp - east-africa dataset_info: - config_name: default features: - name: source dtype: string - name: title dtype: string - name: url dtype: string - name: content dtype: string - name: language dtype: string - name: text dtype: string - name: domain dtype: string - name: id dtype: float64 - name: version dtype: string - name: date_added dtype: string splits: - name: train num_bytes: 325835 num_examples: 242 download_size: 169897 dataset_size: 325835 - config_name: swahili_news_corpus features: - name: text dtype: string - name: label dtype: class_label: names: '0': uchumi '1': kitaifa '2': michezo '3': kimataifa '4': burudani '5': afya splits: - name: train num_bytes: 49517843 num_examples: 22207 download_size: 28693820 dataset_size: 49517843 - config_name: swahili_sentiments_1point5_million features: - name: Swahili dtype: string - name: sentiment dtype: string splits: - name: train num_bytes: 119308210 num_examples: 1500000 download_size: 72255273 dataset_size: 119308210 configs: - config_name: default data_files: - split: train path: data/train-* - config_name: swahili_news_corpus data_files: - split: train path: swahili_news_corpus/train-* - config_name: swahili_sentiments_1point5_million data_files: - split: train path: swahili_sentiments_1point5_million/train-* --- # SWAWEB: The Swahili Web Corpus (v0.2) ## Dataset Description SWAWEB (Swahili Web) is a foundational, open-source corpus of Kiswahili text data collected from diverse East African web sources. Inspired by the need for high-quality, contextually-relevant data for low-resource languages, SWAWEB aims to power the next generation of Kiswahili-centric Natural Language Processing (NLP) models. The current version, **v0.2**, has been significantly expanded to **40 documents**, containing over **250,000 characters** (approximately **60,000 tokens**). This version introduces greater linguistic diversity by incorporating new high-quality sources. ### Why SWAWEB? Kiswahili is a lingua franca spoken by over 100 million people across East and Central Africa, yet it remains significantly underrepresented in global NLP resources. SWAWEB addresses this gap by providing: 1. **Contextual Diversity:** Data is sourced from both formal news media and informal community forums, capturing a wide spectrum of linguistic styles, from formal journalistic prose to colloquial, everyday language. 2. **Local Relevance:** By focusing on sources popular within Tanzania and Kenya, the corpus is rich in local names, places, cultural references, and political discourse, which is essential for building truly useful local AI applications. 3. **Ethical Foundation:** The collection process prioritizes ethical scraping practices, including rate-limiting and a focus on publicly accessible, non-personal data. ## Dataset Structure The dataset is provided in a single configuration, `default`, with the following features: | Feature | Data Type | Description | | :--- | :--- | :--- | | `source` | `string` | The origin of the data (e.g., "JamiiForums", "Mwananchi"). | | `title` | `string` | The title of the article or forum thread. | | `url` | `string` | The original URL of the source document. | | `content` | `string` | The cleaned, extracted text content. | | `language` | `string` | The ISO 639-1 code for the language (`sw`). | ### Data Splits The prototype contains a single `train` split. | Split | Number of Documents | Estimated Tokens | | :--- | :--- | :--- | | `train` | 40 | ~60,000 | ## Data Sources and Collection The v0.2 dataset was collected from four primary sources, ensuring a rich balance between formal, informal, and educational Kiswahili: | Source | Type | Contribution (Documents) | Linguistic Style | | :--- | :--- | :--- | :--- | | **JamiiForums** | Community Forum | 15 | Colloquial, diverse, high-engagement | | **Mwananchi** | News Media | 10 | Formal, journalistic, high-quality | | **BBC Swahili** | International News | 5 | Formal, high-standard, global context | | **Wikipedia** | Educational | 10 | Informational, structured, encyclopedic | ### Collection Methodology The data was collected via a custom Python-based web scraping pipeline using `requests` and `BeautifulSoup`. The process adhered to a strict **politeness policy**, including: - **Rate Limiting:** A delay of 0.5 seconds was implemented between requests to minimize server load. - **Filtering:** HTML tags, navigation elements, subscription prompts, and other common web noise were removed to isolate clean, continuous text. The cleaning logic has been significantly improved in v0.2. ## Ethical Considerations and Licensing ### Licensing The SWAWEB corpus is released under the **Creative Commons Attribution 4.0 International License (CC BY 4.0)**. Users are required to give appropriate credit, provide a link to the license, and indicate if changes were made. ### Data Privacy and Filtering - **Public Data Only:** All data was collected from publicly accessible web pages. - **No Personal Identifiable Information (PII):** The scraping process was designed to avoid the collection of usernames, email addresses, or other PII. For forum data, only the main post content was extracted, and any potentially sensitive information was filtered during the cleaning phase. ## Future Work (Scaling SWAWEB) The long-term goal is to scale SWAWEB to a multi-million document corpus. Future versions will include: - **Source Expansion:** Integration of other high-value sources like *Taifa Leo* (Kenya), *BBC Swahili*, and *Swahili Wikipedia*. - **Advanced Cleaning:** Implementation of language identification models to filter out code-switching and non-Kiswahili content, and use of advanced deduplication techniques (e.g., MinHash). - **Parquet Conversion:** Converting the dataset to the Parquet format for optimized storage and loading performance on the Hugging Face platform. ## How to Load SWAWEB (Example) Once hosted on Hugging Face, the dataset can be easily loaded using the `datasets` library: \`\`\`python from datasets import load_dataset # Replace 'your-username/swaweb' with the actual repository ID dataset = load_dataset("your-username/swaweb", split="train") # Print an example document print(dataset[0]['content']) \`\`\` ## Citation Please cite this work if you use the SWAWEB corpus in your research or application: \`\`\`bibtex @misc{swaweb_corpus, author = {Manus AI and User}, title = {SWAWEB: A Foundational Kiswahili Web Corpus}, year = {2026}, publisher = {Hugging Face}, url = {https://huggingface.co/datasets/your-username/swaweb} } \`\`\`
提供机构:
Stanley03
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作