five

Kiuyha/surabaya-ner-dataset

收藏
Hugging Face2025-12-06 更新2025-12-20 收录
下载链接:
https://hf-mirror.com/datasets/Kiuyha/surabaya-ner-dataset
下载链接
链接失效反馈
官方服务:
资源简介:
--- dataset_info: features: - name: text dtype: string - name: label list: - name: end dtype: int64 - name: label dtype: string - name: start dtype: int64 splits: - name: train num_bytes: 1935486.6313549015 num_examples: 6577 - name: validation num_bytes: 241899.04378496716 num_examples: 822 - name: test num_bytes: 242193.32486013137 num_examples: 823 download_size: 1397404 dataset_size: 2419579 configs: - config_name: default data_files: - split: train path: data/train-* - split: validation path: data/validation-* - split: test path: data/test-* license: apache-2.0 task_categories: - text-classification language: - id size_categories: - 1K<n<10K --- # Surabaya Opinion & Complaint NER Dataset ## Dataset Description This dataset contains labeled Named Entity Recognition (NER) data focusing on public opinions, complaints, and social issues in Surabaya, Indonesia. The data was scraped from social media platforms (Nitter/X and Reddit) and manually labeled for entities relevant to city administration and public sentiment analysis. ## Dataset Details - **Language:** Indonesian (id) - **License:** Apache-2.0 - **Task:** Token Classification (NER) - **Tags:** ner, surabaya, social-media, complaints - **Size:** 1K < n < 10K samples ## Dataset Statistics The dataset is split into three parts: | Split | Count | |-------|-------| | Train | 6,577 | | Validation | 822 | | Test | 823 | | **Total** | **8,222** | ## Data Collection The data was collected using specific keyword queries targeting common urban issues in Surabaya, such as traffic, flooding, public services, and crime. The source text includes informal Indonesian, Suroboyoan slang, and mixed-language text common on social media. ### Scraping Configuration The following keywords and logic were used to gather the raw text from Nitter (X) and Reddit: #### Nitter (X) Queries 1. **General Complaints:** ``` (Surabaya OR Suroboyo) (keluhan OR lapor OR aduan OR masalah OR parah OR buruk OR mengecewakan OR sulit OR lambat OR tidak beres) lang:id -filter:retweets ``` 2. **Infrastructure & Traffic:** ``` (Surabaya OR Suroboyo) (macet OR jalanan rusak OR parkir liar OR angkot OR bemo OR "suroboyo bus" OR "traffic light" OR lampu merah OR trotoar) lang:id -filter:retweets ``` 3. **Utilities (Water/Power):** ``` (Surabaya OR Suroboyo) (PLN OR listrik padam OR mati lampu OR PDAM OR air mati OR air keruh OR tagihan bengkak) lang:id -filter:retweets ``` 4. **Flooding & Waste:** ``` (Surabaya OR Suroboyo) (banjir OR genangan OR sampah OR "bau tidak sedap" OR got mampet OR sungai kotor OR tumpukan sampah) lang:id -filter:retweets ``` 5. **Public Services:** ``` (@sapawargasby OR @banggasurabaya OR pemkot sby OR kelurahan OR kecamatan) (layanan OR pengurusan OR e-ktp OR kk OR izin OR respon lambat) lang:id -filter:retweets ``` 6. **Safety & Crime:** ``` (Surabaya OR Suroboyo) (aman OR tidak aman OR begal OR curanmor OR maling OR tawuran OR kejahatan OR gangster) lang:id -filter:retweets ``` 7. **Positive Feedback:** ``` (Surabaya OR Suroboyo) (terima kasih OR keren OR mantap OR bagus OR apresiasi OR cepat OR solutif OR membantu) (@pemkotsby OR @sapawargasby OR layanan) lang:id -filter:retweets ``` #### Reddit Queries 1. `surabaya traffic` 2. `surabaya flood` 3. `surabaya criminal` 4. `suroboyo` ## Usage You can load this dataset directly using the Hugging Face `datasets` library: ```python from datasets import load_dataset dataset = load_dataset("Kiuyha/surabaya-ner-dataset") print(dataset['train'][0]) ``` ## Use Cases - Named Entity Recognition model training for Indonesian social media text - Public sentiment analysis for city administration - Urban issue detection and classification - Social media monitoring for local government - Dialect-aware NLP research (Suroboyoan slang) ## Citation If you use this dataset in your research, please cite it appropriately and acknowledge the source. ## Contact For questions or issues regarding this dataset, please open an issue on the dataset repository page.
提供机构:
Kiuyha
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作