register_oscar
收藏魔搭社区2025-12-05 更新2025-08-23 收录
下载链接:
https://modelscope.cn/datasets/TurkuNLP/register_oscar
下载链接
链接失效反馈官方服务:
资源简介:
# Dataset Card for register_oscar
## Table of Contents
- [Dataset Description](#dataset-description)
- [Dataset Summary](#dataset-summary)
- [Supported Tasks](#supported-tasks-and-leaderboards)
- [Languages](#languages)
- [Dataset Structure](#dataset-structure)
- [Data Instances](#data-instances)
- [Data Fields](#data-instances)
### Dataset Summary
The Register Oscar dataset is a multilingual dataset, containing languaegs from the Oscar dataset that have been tagged with register information.
8 main-level registers:
* Narrative (NA)
* Informational Description (IN)
* Opinion (OP)
* Interactive Discussion (ID)
* How-to/Instruction (HI)
* Informational Persuasion (IP)
* Lyrical (LY)
* Spoken (SP)
For further description of the labels, see (Douglas Biber and Jesse Egbert. 2018. Register variation online)
Code used to tag Register Oscar can be found at https://github.com/TurkuNLP/register-labeling
### Languages
Currently contains the following languages: Arabic, Bengali, Catalan, English, Spanish, Basque, French, Hindi, Indonesian, Portuguese, Swahili, Urdu, Vietnamese and Chinese.
For further information on the languages and data, see https://huggingface.co/datasets/oscar
## Dataset Structure
### Data Instances
```
{"id": "0", "labels": ["NA"], "text": "Zarif: Iran inajua mpango wa Saudia wa kufanya mauaji ya kigaidi dhidi ya maafisa wa ngazi za juu wa Iran\n"}
```
### Data Fields
* id: unique id of the document (from the Oscar dataset)
* labels: the list of labels assigned to the text
* text: the original text of the document (as appears in the Oscar dataset)
### Citing
```
@inproceedings{laippala-etal-2022-towards,
title = "Towards better structured and less noisy Web data: Oscar with Register annotations",
author = {Laippala, Veronika and
Salmela, Anna and
R{\"o}nnqvist, Samuel and
Aji, Alham Fikri and
Chang, Li-Hsin and
Dhifallah, Asma and
Goulart, Larissa and
Kortelainen, Henna and
P{\`a}mies, Marc and
Prina Dutra, Deise and
Skantsi, Valtteri and
Sutawika, Lintang and
Pyysalo, Sampo},
booktitle = "Proceedings of the Eighth Workshop on Noisy User-generated Text (W-NUT 2022)",
month = oct,
year = "2022",
address = "Gyeongju, Republic of Korea",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2022.wnut-1.23",
pages = "215--221",
abstract = {Web-crawled datasets are known to be noisy, as they feature a wide range of language use covering both user-generated and professionally edited content as well as noise originating from the crawling process. This article presents one solution to reduce this noise by using automatic register (genre) identification -whether the texts are, e.g., forum discussions, lyrical or how-to pages. We apply the multilingual register identification model by R{\"o}nnqvist et al. (2021) and label the widely used Oscar dataset. Additionally, we evaluate the model against eight new languages, showing that the performance is comparable to previous findings on a restricted set of languages. Finally, we present and apply a machine learning method for further cleaning text files originating from Web crawls from remains of boilerplate and other elements not belonging to the main text of the Web page. The register labeled and cleaned dataset covers 351 million documents in 14 languages and is available at https://huggingface.co/datasets/TurkuNLP/register{\_}oscar.},
}
```
# Register Oscar(register_oscar)数据集卡片
## 目录
- [数据集描述](#dataset-description)
- [数据集概况](#dataset-summary)
- [支持任务与排行榜](#supported-tasks-and-leaderboards)
- [语言覆盖](#languages)
- [数据集结构](#dataset-structure)
- [数据样例](#data-instances)
- [数据字段](#data-fields)
### 数据集概况
Register Oscar(register_oscar)数据集是一个多语言数据集,其数据源自Oscar数据集,并已完成语域(register)标注。
该数据集包含8个主级语域标签:
* 叙事类(Narrative,NA)
* 信息描述类(Informational Description,IN)
* 观点类(Opinion,OP)
* 互动讨论类(Interactive Discussion,ID)
* 操作指南类(How-to/Instruction,HI)
* 信息说服类(Informational Persuasion,IP)
* 抒情类(Lyrical,LY)
* 口语类(Spoken,SP)
如需了解标签的详细说明,请参考:Douglas Biber 与 Jesse Egbert 于2018年发表的《Register variation online》(《在线语域变异》)。
用于标注该数据集的代码可在以下地址获取:https://github.com/TurkuNLP/register-labeling
### 语言覆盖
当前数据集包含以下语言:阿拉伯语、孟加拉语、加泰罗尼亚语、英语、西班牙语、巴斯克语、法语、印地语、印度尼西亚语、葡萄牙语、斯瓦希里语、乌尔都语、越南语与汉语。
如需了解该数据集语言与数据的更多信息,请访问:https://huggingface.co/datasets/oscar
## 数据集结构
### 数据样例
{"id": "0", "labels": ["NA"], "text": "Zarif: Iran inajua mpango wa Saudia wa kufanya mauaji ya kigaidi dhidi ya maafisa wa ngazi za juu wa Iran
"}
### 数据字段
* id:文档的唯一标识符(源自Oscar数据集)
* labels:为文本分配的标签列表
* text:文档的原始文本(与Oscar数据集中的文本一致)
### 引用
@inproceedings{laippala-etal-2022-towards,
title = "迈向更结构化、低噪声的网络数据:带语域标注的Oscar数据集",
author = {Laippala, Veronika and
Salmela, Anna and
Rönnqvist, Samuel and
Aji, Alham Fikri and
Chang, Li-Hsin and
Dhifallah, Asma and
Goulart, Larissa and
Kortelainen, Henna and
Pàmies, Marc and
Prina Dutra, Deise and
Skantsi, Valtteri and
Sutawika, Lintang and
Pyysalo, Sampo},
booktitle = "第八届网络用户生成文本噪声研讨会(W-NUT 2022)论文集",
month = oct,
year = "2022",
address = "Gyeongju, Republic of Korea",
publisher = "计算语言学协会",
url = "https://aclanthology.org/2022.wnut-1.23",
pages = "215--221",
abstract = {网络爬取数据集普遍存在噪声问题,因其涵盖了用户生成内容与专业编辑内容等多元语言使用场景,且包含爬取过程中产生的噪声。本文提出了一种通过自动语域(register)识别来降低噪声的解决方案——即判断文本属于论坛讨论、抒情文本还是操作指南等类型。我们采用Rönnqvist等人(2021)提出的多语种语域识别模型,对广泛使用的Oscar数据集进行标注。此外,我们针对8种新语言评估了该模型,结果显示其性能与此前在有限语种上的测试结果相当。最后,我们提出并应用了一种机器学习方法,用于进一步清理网络爬取所得的文本文件,去除其中的模板代码及其他不属于网页正文的元素。经语域标注与清理后的数据集包含14种语言共计3.51亿条文档,可在https://huggingface.co/datasets/TurkuNLP/register_oscar获取。},
}
提供机构:
maas
创建时间:
2025-08-08



