five

TurkuNLP/hplt-social-media-registers

收藏
Hugging Face2026-03-24 更新2026-04-05 收录
下载链接:
https://hf-mirror.com/datasets/TurkuNLP/hplt-social-media-registers
下载链接
链接失效反馈
官方服务:
资源简介:
--- language: - en - fi - sv license: cc-by-4.0 task_categories: - text-classification tags: - register - social-media - HPLT - web-crawl - discourse-analysis --- # Dataset Card: hplt-social-media-registers This dataset contains 621,357 English, Finnish, and Swedish social media documents extracted from the [HPLT 2.0](https://hplt-project.org/datasets/v2.0) web corpus, enriched with web register labels and thematic subregister cluster labels. It was produced as part of Fin-CLARIAH deliverable D3.3.3 ("Machine Learning-Based Enrichment of Social Media") and is intended to support corpus linguistics research and downstream NLP on social media language varieties. ## Dataset Details ### Dataset Description The dataset covers web documents with social media register characteristics from the [CORE web register taxonomy](https://turkunlp.org/register-annotation-docs/), including Interactive Discussion (ID), Narrative Blog (NA-nb), and Opinion Blog (NA-nb-OP) labels, as well as various hybrid combinations of these (e.g. `ds-IP-NA-nb`, `HI-NA-nb-re`). Documents were automatically classified using [TurkuNLP/web-register-classification-multilingual-bge](https://huggingface.co/TurkuNLP/web-register-classification-multilingual-bge) and then clustered into thematic subregisters (e.g., sports, dining, culture) using HDBSCAN on BGE-M3 embeddings. Cluster labels were assigned manually by the dataset creators. - **Curated by:** Erik Henriksson, Tuomas Lundberg, Veronika Laippala (TurkuNLP, University of Turku) - **Funded by:** Fin-CLARIAH (Research Council of Finland, grant no. 358720) - **Language(s):** English (`en`), Finnish (`fi`), Swedish (`sv`) - **License:** CC BY 4.0 - **Source corpus:** [HPLT 2.0](https://hplt-project.org/datasets/v2.0) (CC0) ### Dataset Sources - **Repository:** https://github.com/TurkuNLP/social-media-enrichment-fin-clariah - **Paper:** Henriksson et al. (2024), [Automatic register identification for the open web using multilingual deep learning](https://arxiv.org/abs/2406.19892), arXiv:2406.19892 - **Demo:** [Google Colab tutorial](https://colab.research.google.com/drive/1VV_I0VNbIJbt_dbhB-WH5uZM92rF5o-8) - **Deliverable overview:** https://www.kielipankki.fi/organization/fin-clariah/deliverables/fin-clariah2024_d3-3-3/ ## Uses ### Direct Use - Corpus linguistics research on social media language variation (register, topic, style) - Training or evaluating downstream classifiers for social media subregister detection - Qualitative and quantitative analysis of thematic content in web discourse - Use as a labeled reference corpus for Finnish, Swedish, and English social media text ### Out-of-Scope Use - This dataset covers only the social media subset of web registers; it is not representative of the full web register distribution - The register and cluster labels are automatically assigned (ML-predicted), not manually verified at document level — do not treat them as gold-standard annotations - Re-identification of individuals from the text is not an intended or appropriate use ## Dataset Structure ### Fields | Field | Type | Description | |---|---|---| | `text` | string | The document text | | `register` | string | CORE web register label (hierarchical, e.g. `NA-nb-OP`) | | `embedding` | float64[] | BGE-M3 document embedding used for clustering (1024 dimensions) | | `language` | string | ISO 639-1 language code (`en`, `fi`, `sv`) | | `cluster_label` | string | Thematic subregister label (e.g. `sports`, `dining`); empty string if not thematically labeled | ### Register and Cluster Label Taxonomy Documents belong to one of the following register × language combinations, with the named cluster labels shown. An empty `cluster_label` means the document belongs to a cluster that was not given a thematic name. | Language | Register | Named cluster labels | |---|---|---| | `en` | `ID` | sports | | `en` | `NA-nb` | comments | | `en` | `NA-nb-OP` | culture, dining, lifestyle | | `fi` | `ID` | sports | | `fi` | `ID-NA` | comments | | `fi` | `NA-nb` | comments | | `fi` | `NA-nb-OP` | culture, consumption | | `fi` | `NA-nb-OP-rv` | books, dining, beverages, cosmetics | | `sv` | `ds-IP-NA-nb` | travel, contests | | `sv` | `HI-NA-nb-re` | crafts | | `sv` | `ID` | sports, help | | `sv` | `ID-NA-nb` | comments | | `sv` | `IN-NA-nb` | organizations | | `sv` | `NA-nb` | comments | | `sv` | `NA-nb-OP` | finance | | `sv` | `NA-nb-OP-rv` | lifestyle, culture | | `sv` | `NA-ob-OP` | sports | See [the CORE register taxonomy documentation](https://turkunlp.org/register-annotation-docs/) for the meaning of the hierarchical register labels. ### Size | Language | Documents | With cluster label | |---|---|---| | English | 91,633 | 4,559 (5.0%) | | Finnish | 261,688 | 32,372 (12.4%) | | Swedish | 268,036 | 30,919 (11.5%) | | **Total** | **621,357** | **67,850 (10.9%)** | ## Dataset Creation ### Curation Rationale The dataset was created to provide a large, reusable labeled social media corpus, with a particular focus on Finnish and Swedish — languages underrepresented in social media NLP resources — alongside English for cross-linguistic comparison. The labels enable both coarse-grained register analysis (which broad type of social media?) and fine-grained thematic analysis (what is the text about?). The dataset is a key output of Fin-CLARIAH D3.3.3 and is intended for use in corpus linguistic research and NLP model development. ### Source Data #### Data Collection and Processing Source documents were drawn from the [HPLT 2.0](https://hplt-project.org/datasets/v2.0) web crawl corpus for English, Finnish, and Swedish. The processing pipeline was: 1. **Register classification**: All documents were classified using [TurkuNLP/web-register-classification-multilingual-bge](https://huggingface.co/TurkuNLP/web-register-classification-multilingual-bge), a multilingual BGE-M3-based web register classifier covering 25 CORE taxonomy labels. 2. **Social media filtering**: Documents predicted to belong to social media registers (Interactive Discussion, Narrative Blog, Opinion Blog, and their hybrids) were retained. 3. **Embedding**: Retained documents were embedded using BGE-M3 to produce dense vector representations. 4. **Clustering**: HDBSCAN clustering was applied to the embeddings within each register group to identify thematic subclusters. 5. **Cluster labeling**: Clusters were inspected manually and assigned human-readable thematic labels (e.g., `sports`, `dining`). Clusters that were too mixed or too small to label meaningfully were left with an empty `cluster_label`. #### Who are the source data producers? The source texts are web pages collected by the HPLT project via CommonCrawl. The original authors are the writers of those web pages. No demographic information about the source authors is available. ### Annotations #### Annotation process Thematic cluster labels were assigned by the dataset curators (Erik Henriksson and Veronika Laippala) by inspecting cluster contents and selecting descriptive labels. Register labels were assigned automatically by the BGE-M3 classifier without manual verification. #### Who are the annotators? - **Cluster label assignment:** Erik Henriksson and Veronika Laippala (TurkuNLP) - **Register labels:** Automatically predicted by [TurkuNLP/web-register-classification-multilingual-bge](https://huggingface.co/TurkuNLP/web-register-classification-multilingual-bge) - **Pipeline implementation:** Tuomas Lundberg and Erik Henriksson (TurkuNLP) #### Personal and Sensitive Information The dataset is derived from publicly available web pages and may contain personal names, contact details, or other personally identifiable information consistent with the broader HPLT 2.0 corpus. No deliberate anonymization was applied beyond what is present in the HPLT 2.0 source. ## Bias, Risks, and Limitations - **Automated labeling:** Register labels are ML predictions, not manual annotations, and will contain some errors. Users should account for classifier noise, especially for documents with low-confidence predictions or hybrid register profiles. - **Language imbalance:** Finnish (261,688) and Swedish (268,036) are roughly balanced, while English (91,633) is substantially smaller. Models trained on this data may generalize unevenly across languages. - **English cluster coverage:** Only 5.0% of English documents have a non-empty `cluster_label`, compared to ~12% for Finnish and Swedish. Thematic subregister analysis is therefore much more limited for English. - **Register imbalance:** Narrative Blog (`NA-nb`) accounts for ~79% of all documents (490,039 of 621,357). The remaining registers — Interactive Discussion, Opinion Blog, and hybrid types — are represented at much smaller scale. - **Cluster label skew:** Among labeled documents, `comments` is by far the most frequent cluster label (47,961 of 67,850, ~71%). Other thematic labels such as `travel`, `contests`, and `finance` have fewer than 300 examples each. - **Web crawl quality:** Source texts vary in quality, including near-duplicate content, boilerplate text, and encoding artefacts typical of web crawl corpora. - **Incomplete cluster coverage:** Only 10.9% of documents have a thematic cluster label. Unlabeled clusters (`cluster_label = ""`) exist where the thematic content was too heterogeneous or sparse to characterize. - **Register taxonomy scope:** The CORE taxonomy's application to Finnish and Swedish may not capture all culturally specific text varieties. ### Recommendations Users should be aware that register labels are automatically predicted and not manually verified at document level. For high-stakes annotation tasks, downstream validation against manually labeled samples is recommended. The empty-string cluster label should be handled explicitly in any code that processes the `cluster_label` field. ## Citation If you use this dataset, please cite: **BibTeX:** ```bibtex @misc{henriksson2024automaticregisteridentification, title={Automatic register identification for the open web using multilingual deep learning}, author={Erik Henriksson and Amanda Myntti and Anni Eskelinen and Selcen Erten-Johansson and Saara Hellstr{\"o}m and Veronika Laippala}, year={2024}, eprint={2406.19892}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2406.19892} } ``` **APA:** Henriksson, E., Myntti, A., Eskelinen, A., Erten-Johansson, S., Hellström, S., & Laippala, V. (2024). *Automatic register identification for the open web using multilingual deep learning*. arXiv:2406.19892. ## Dataset Card Authors Erik Henriksson, Tuomas Lundberg, Veronika Laippala (TurkuNLP, University of Turku) ## Dataset Card Contacts - Erik Henriksson — [erikhenriksson](https://huggingface.co/erikhenriksson) (Hugging Face) | TurkuNLP, University of Turku - Tuomas Lundberg — [tuomaslundberg](https://huggingface.co/tuomaslundberg) (Hugging Face) | TurkuNLP, University of Turku
提供机构:
TurkuNLP
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作