mlenjoyneer/RuTextSegWiki
收藏Hugging Face2023-09-03 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/mlenjoyneer/RuTextSegWiki
下载链接
链接失效反馈官方服务:
资源简介:
---
annotations_creators:
- machine-generated
language_creators:
- found
language:
- ru
size_categories:
- 10K<n<100K
license:
- unknown
multilinguality:
- monolingual
source_datasets:
- original
---
# Dataset Card
## Table of Contents
- [Table of Contents](#table-of-contents)
- [Dataset Description](#dataset-description)
- [Dataset Summary](#dataset-summary)
- [Supported Tasks and Leaderboards](#supported-tasks-and-leaderboards)
- [Languages](#languages)
- [Dataset Structure](#dataset-structure)
- [Data Instances](#data-instances)
- [Data Fields](#data-fields)
- [Data Splits](#data-splits)
- [Additional Information](#additional-information)
- [Licensing Information](#licensing-information)
- [Citation Information](#citation-information)
## Dataset Description
### Dataset Summary
Dataset for automatic text segmentation of Russian wiki. Text corpora based on May 2023 Wikipedia dump. Markup was generated automatically based on 2 methods: taking texts with ready division into paragraphs and random joining parts of different texts.
### Supported Tasks and Leaderboards
Dataset designed for text segmentation task.
### Languages
The dataset is in Russian.
### Usage
```python
from datasets import load_dataset
dataset = load_dataset('mlenjoyneer/RuTextSegWiki')
```
### Other datasets
mlenjoyneer/RuTextSegNews - similar dataset based on news corpora
## Dataset Structure
### Data Instances
For each instance, there is a list of strings for text sentences, a list of ints for labels (1 is new topic starting and 0 is previous topic continuation) and a string for sample generation method (base or random_joining).
### Data Splits
| Dataset Split | Number of Instances in Split |
|:---------|:---------|
| Train | 20000 |
| Test | 4000 |
## Additional Information
### Licensing Information
In progress
### Citation Information
```bibtex
In progress
```
提供机构:
mlenjoyneer
原始信息汇总
数据集概述
数据集描述
数据集摘要
该数据集用于俄语维基百科的自动文本分割。文本语料库基于2023年5月的维基百科转储。标记是基于两种方法自动生成的:采用已分段为段落的文本和随机连接不同文本的部分。
支持的任务和排行榜
该数据集设计用于文本分割任务。
语言
数据集为俄语。
数据集结构
数据实例
每个实例包含一个文本句子的字符串列表、一个标签的整数列表(1表示新主题开始,0表示前一主题的延续)和一个样本生成方法的字符串(base或random_joining)。
数据分割
| 数据集分割 | 分割中的实例数量 |
|---|---|
| 训练集 | 20000 |
| 测试集 | 4000 |
附加信息
许可信息
正在进行中
引用信息
正在进行中



