The Global English Accent Conversational NLP Dataset is a comprehensive collection of validated English speech recordings sourced from native and non-native English speakers across key global regions. This dataset is designed for training Natural Language Processing models, conversational AI, Automatic Speech Recognition (ASR), and linguistic research, with a focus on regional accent variation.
Regions and Covered Countries with Primary Spoken Languages:
Africa:
South Africa (English, Zulu, Afrikaans, Xhosa)
Nigeria (English, Yoruba, Igbo, Hausa)
Kenya (English, Swahili)
Ghana (English, Twi, Ewe, Ga)
Uganda (English, Luganda)
Ethiopia (English, Amharic, Oromo)
Central & South America:
Mexico (Spanish, English as a second language)
Guatemala (Spanish, K'iche', English)
El Salvador (Spanish, English)
Costa Rica (Spanish, English in Caribbean regions)
Colombia (Spanish, English in urban centers)
Dominican Republic (Spanish, English in tourist zones)
Brazil (Portuguese, English in urban areas)
Argentina (Spanish, English among educated speakers)
Southeast Asia & South Asia:
Philippines (Filipino, English)
Vietnam (Vietnamese, English)
Malaysia (Malay, English, Mandarin)
Indonesia (Indonesian, Javanese, English)
Singapore (English, Mandarin, Malay, Tamil)
India (Hindi, English, Bengali, Tamil)
Pakistan (Urdu, English, Punjabi)
Europe:
United Kingdom (English)
Ireland (English, Irish)
Germany (German, English)
France (French, English)
Spain (Spanish, Catalan, English)
Italy (Italian, English)
Portugal (Portuguese, English)
Oceania:
Australia (English)
New Zealand (English, Māori)
Fiji (English, Fijian)
North America:
United States (English, Spanish)
Canada (English, French)
Dataset Attributes:
- Conversational English with natural accent variation
- Global coverage with balanced male/female speakers
- Rich speaker metadata: age, gender, country, city
- Average audio length of ~30 minutes per participant
- All samples manually validated for accuracy
- Structured format suitable for machine learning and AI applications
Best suited for:
- NLP model training and evaluation
- Multilingual ASR system development
- Voice assistant and chatbot design
- Accent recognition research
- Voice synthesis and TTS modeling
This dataset ensures global linguistic diversity and delivers high-quality audio for AI developers, researchers, and enterprises working on voice-based applications.