Kiuyha/surabaya-ner-dataset
收藏Hugging Face2025-12-06 更新2025-12-20 收录
下载链接:
https://hf-mirror.com/datasets/Kiuyha/surabaya-ner-dataset
下载链接
链接失效反馈官方服务:
资源简介:
---
dataset_info:
features:
- name: text
dtype: string
- name: label
list:
- name: end
dtype: int64
- name: label
dtype: string
- name: start
dtype: int64
splits:
- name: train
num_bytes: 1935486.6313549015
num_examples: 6577
- name: validation
num_bytes: 241899.04378496716
num_examples: 822
- name: test
num_bytes: 242193.32486013137
num_examples: 823
download_size: 1397404
dataset_size: 2419579
configs:
- config_name: default
data_files:
- split: train
path: data/train-*
- split: validation
path: data/validation-*
- split: test
path: data/test-*
license: apache-2.0
task_categories:
- text-classification
language:
- id
size_categories:
- 1K<n<10K
---
# Surabaya Opinion & Complaint NER Dataset
## Dataset Description
This dataset contains labeled Named Entity Recognition (NER) data focusing on public opinions, complaints, and social issues in Surabaya, Indonesia. The data was scraped from social media platforms (Nitter/X and Reddit) and manually labeled for entities relevant to city administration and public sentiment analysis.
## Dataset Details
- **Language:** Indonesian (id)
- **License:** Apache-2.0
- **Task:** Token Classification (NER)
- **Tags:** ner, surabaya, social-media, complaints
- **Size:** 1K < n < 10K samples
## Dataset Statistics
The dataset is split into three parts:
| Split | Count |
|-------|-------|
| Train | 6,577 |
| Validation | 822 |
| Test | 823 |
| **Total** | **8,222** |
## Data Collection
The data was collected using specific keyword queries targeting common urban issues in Surabaya, such as traffic, flooding, public services, and crime. The source text includes informal Indonesian, Suroboyoan slang, and mixed-language text common on social media.
### Scraping Configuration
The following keywords and logic were used to gather the raw text from Nitter (X) and Reddit:
#### Nitter (X) Queries
1. **General Complaints:**
```
(Surabaya OR Suroboyo) (keluhan OR lapor OR aduan OR masalah OR parah OR buruk OR mengecewakan OR sulit OR lambat OR tidak beres) lang:id -filter:retweets
```
2. **Infrastructure & Traffic:**
```
(Surabaya OR Suroboyo) (macet OR jalanan rusak OR parkir liar OR angkot OR bemo OR "suroboyo bus" OR "traffic light" OR lampu merah OR trotoar) lang:id -filter:retweets
```
3. **Utilities (Water/Power):**
```
(Surabaya OR Suroboyo) (PLN OR listrik padam OR mati lampu OR PDAM OR air mati OR air keruh OR tagihan bengkak) lang:id -filter:retweets
```
4. **Flooding & Waste:**
```
(Surabaya OR Suroboyo) (banjir OR genangan OR sampah OR "bau tidak sedap" OR got mampet OR sungai kotor OR tumpukan sampah) lang:id -filter:retweets
```
5. **Public Services:**
```
(@sapawargasby OR @banggasurabaya OR pemkot sby OR kelurahan OR kecamatan) (layanan OR pengurusan OR e-ktp OR kk OR izin OR respon lambat) lang:id -filter:retweets
```
6. **Safety & Crime:**
```
(Surabaya OR Suroboyo) (aman OR tidak aman OR begal OR curanmor OR maling OR tawuran OR kejahatan OR gangster) lang:id -filter:retweets
```
7. **Positive Feedback:**
```
(Surabaya OR Suroboyo) (terima kasih OR keren OR mantap OR bagus OR apresiasi OR cepat OR solutif OR membantu) (@pemkotsby OR @sapawargasby OR layanan) lang:id -filter:retweets
```
#### Reddit Queries
1. `surabaya traffic`
2. `surabaya flood`
3. `surabaya criminal`
4. `suroboyo`
## Usage
You can load this dataset directly using the Hugging Face `datasets` library:
```python
from datasets import load_dataset
dataset = load_dataset("Kiuyha/surabaya-ner-dataset")
print(dataset['train'][0])
```
## Use Cases
- Named Entity Recognition model training for Indonesian social media text
- Public sentiment analysis for city administration
- Urban issue detection and classification
- Social media monitoring for local government
- Dialect-aware NLP research (Suroboyoan slang)
## Citation
If you use this dataset in your research, please cite it appropriately and acknowledge the source.
## Contact
For questions or issues regarding this dataset, please open an issue on the dataset repository page.
提供机构:
Kiuyha



