Hausa-English Code-Switched Dataset
收藏DataCite Commons2025-05-01 更新2025-04-16 收录
下载链接:
https://data.mendeley.com/datasets/3xjyjsf4sb
下载链接
链接失效反馈官方服务:
资源简介:
Hausa-English Code-Switched Dataset
Overview
The Hausa-English Code-Switched Dataset contains comments collected from Facebook, Instagram, YouTube, and Twitter. These comments exhibit code-switching between Hausa and English, providing a rich resource for linguistic research, natural language processing (NLP), and machine translation.
Features
Platform Support: Includes comments from Facebook, Instagram, YouTube, and Twitter.
Multilingual Data: Captures code-switching between Hausa and English, reflecting real-world multilingual usage.
Customizable: Adaptable for other language combinations and specific data collection needs.
Data Collection Process
The dataset was collected using a custom scraper designed to gather code-switched comments from social media platforms. Here’s a brief overview of the process:
Platform Integration: Configured to work with Facebook, Instagram, YouTube, and Twitter APIs.
Multilingual Data Capture: Identified comments with code-switching between Hausa and English.
Configuration: Set up API keys and platform-specific settings.
Execution: Ran the scraper on each platform, collecting and aggregating comments.
Applications
The dataset supports various research and application domains:
Linguistic Analysis: Study code-switching patterns between Hausa and English.
NLP: Train and evaluate models for tasks like language identification and part-of-speech tagging.
Machine Translation: Provides parallel data for training translation systems.
Sociolinguistic Studies: Explore social and cultural factors influencing code-switching on social media.
Dataset Structure
The dataset is organized into a CSV file with the following columns:
Platform: The social media platform (Facebook, Instagram, YouTube, Twitter).
Date: The date the comment was posted.
Time: The time the comment was posted.
User ID: A unique identifier for the user.
Comment: The code-switched comment containing Hausa and English text.
English Translation: The correct English translation of the code-switched comment.
Example Entries
Platform Date Time User ID Comment English Translation
Facebook 2023-06-15 14:23:45 user123 Ina son wannan song, it's really great! I love this song, it's really great!
Twitter 2023-06-15 14:23:45 user124 Yau ne zamu je gidan abinci, can't wait! Today we are going to the restaurant, can't wait!
Instagram 2023-06-15 14:23:45 user125 Kai, wannan video is so funny! Wow, this video is so funny!
YouTube 2023-06-15 14:23:45 user126 Na gode for sharing this, very informative! Thank you for sharing this, very informative!
Conclusion
The Hausa-English Code-Switched Dataset is a valuable resource for researchers and practitioners in linguistics, NLP, and machine translation. It provides real-world examples of code-switching, supporting the development of robust models and tools for handling multilingual text in diverse contexts. Explore the dataset and contribute to its ongoing development and application.
# 豪萨语-英语语码转换数据集(Hausa-English Code-Switched Dataset)
## 概述
本数据集收录了来自脸书(Facebook)、照片墙(Instagram)、优兔(YouTube)及推特(Twitter)的评论。这些评论展现了豪萨语与英语之间的语码转换现象,可为语言学研究、自然语言处理(Natural Language Processing,NLP)以及机器翻译提供丰富的研究资源。
## 核心特性
### 平台覆盖
收录来自上述四大社交媒体平台的评论数据。
### 多语言特性
涵盖豪萨语与英语的语码转换内容,真实反映现实场景中的多语言使用习惯。
### 可定制性
可灵活适配其他语言组合及特定的数据采集需求。
## 数据采集流程
本数据集通过定制化爬虫工具采集自社交媒体平台的语码转换评论,采集流程概述如下:
1. **平台对接**:配置适配脸书、照片墙、优兔及推特的应用程序编程接口(Application Programming Interface,API)。
2. **多语言数据捕获**:自动识别包含豪萨语与英语语码转换的评论。
3. **参数配置**:设置应用程序编程接口密钥及各平台专属配置项。
4. **执行采集**:在各平台同步运行爬虫工具,完成评论的采集与聚合。
## 应用场景
本数据集可支撑多类研究与应用领域:
1. **语言学分析**:用于研究豪萨语与英语的语码转换模式与规律。
2. **自然语言处理**:可训练并评估语言识别、词性标注等下游任务的模型。
3. **机器翻译**:可为翻译系统的训练提供高质量平行语料。
4. **社会语言学研究**:探究社交媒体语境下影响语码转换的社会与文化因素。
## 数据集结构
本数据集以逗号分隔值(Comma-Separated Values,CSV)文件格式组织,包含以下字段:
- **平台**:评论发布的社交媒体平台,可选值为脸书、照片墙、优兔、推特。
- **日期**:评论的发布日期,格式为`YYYY-MM-DD`。
- **时间**:评论的发布时间,格式为`HH:MM:SS`。
- **用户ID**:评论发布者的唯一标识符。
- **评论内容**:包含豪萨语与英语的语码转换原始文本。
- **英语译文**:该语码转换评论的标准英语译版。
## 示例条目
| 平台 | 日期 | 时间 | 用户ID | 评论内容 | 英语译文 |
|--------|------------|--------------|----------|--------------------------------------------------------------------------|--------------------------------------------------------------------------|
| 脸书 | 2023-06-15 | 14:23:45 | user123 | "Ina son wannan song, it's really great!" | "I love this song, it's really great!" |
| 推特 | 2023-06-15 | 14:23:45 | user124 | "Yau ne zamu je gidan abinci, can't wait!" | "Today we are going to the restaurant, can't wait!" |
| 照片墙 | 2023-06-15 | 14:23:45 | user125 | "Kai, wannan video is so funny!" | "Wow, this video is so funny!" |
| 优兔 | 2023-06-15 | 14:23:45 | user126 | "Na gode for sharing this, very informative!" | "Thank you for sharing this, very informative!" |
## 结语
豪萨语-英语语码转换数据集是语言学、自然语言处理及机器翻译领域研究者与从业者的宝贵资源。其收录的真实语码转换案例,可为开发适用于多样化场景的多语言文本处理模型与工具提供坚实支撑。欢迎探索本数据集,并为其后续的迭代开发与落地应用贡献力量。
提供机构:
Mendeley Data
创建时间:
2024-07-19



