tugstugi/eduge
收藏Hugging Face2024-01-18 更新2024-05-25 收录
下载链接:
https://hf-mirror.com/datasets/tugstugi/eduge
下载链接
链接失效反馈官方服务:
资源简介:
---
annotations_creators:
- expert-generated
language_creators:
- expert-generated
language:
- mn
license:
- unknown
multilinguality:
- monolingual
size_categories:
- 10K<n<100K
source_datasets:
- original
task_categories:
- text-classification
task_ids:
- multi-class-classification
pretty_name: Eduge
dataset_info:
features:
- name: news
dtype: string
- name: label
dtype:
class_label:
names:
'0': урлаг соёл
'1': эдийн засаг
'2': эрүүл мэнд
'3': хууль
'4': улс төр
'5': спорт
'6': технологи
'7': боловсрол
'8': байгал орчин
splits:
- name: train
num_bytes: 255275842
num_examples: 60528
- name: test
num_bytes: 64451731
num_examples: 15133
download_size: 320395067
dataset_size: 319727573
---
# Dataset Card for Eduge
## Table of Contents
- [Dataset Description](#dataset-description)
- [Dataset Summary](#dataset-summary)
- [Supported Tasks and Leaderboards](#supported-tasks-and-leaderboards)
- [Languages](#languages)
- [Dataset Structure](#dataset-structure)
- [Data Instances](#data-instances)
- [Data Fields](#data-fields)
- [Data Splits](#data-splits)
- [Dataset Creation](#dataset-creation)
- [Curation Rationale](#curation-rationale)
- [Source Data](#source-data)
- [Annotations](#annotations)
- [Personal and Sensitive Information](#personal-and-sensitive-information)
- [Considerations for Using the Data](#considerations-for-using-the-data)
- [Social Impact of Dataset](#social-impact-of-dataset)
- [Discussion of Biases](#discussion-of-biases)
- [Other Known Limitations](#other-known-limitations)
- [Additional Information](#additional-information)
- [Dataset Curators](#dataset-curators)
- [Licensing Information](#licensing-information)
- [Citation Information](#citation-information)
- [Contributions](#contributions)
## Dataset Description
- **Homepage:** http://eduge.mn/
- **Repository:** https://github.com/tugstugi/mongolian-nlp
- **Paper:** [Needs More Information]
- **Leaderboard:** [Needs More Information]
- **Point of Contact:** [Needs More Information]
### Dataset Summary
Eduge news classification dataset provided by Bolorsoft LLC. Used to train the Eduge.mn production news classifier
75K news articles in 9 categories: урлаг соёл, эдийн засаг, эрүүл мэнд, хууль, улс төр, спорт, технологи, боловсрол and байгал орчин
### Supported Tasks and Leaderboards
- `text-classification`: We can transform the above into a 9-class classification task.
### Languages
The text in the dataset is in Mongolian
## Dataset Structure
### Data Instances
For the `default` configuration:
```
{
'label': 0, # 'урлаг соёл'
'news': 'Шударга өрсөлдөөн, хэрэглэгчийн төлөө газар 2013 оны дөрөвдүгээр сараас эхлэн Монгол киноны ашиг орлогын мэдээллийг олон нийтэд хүргэж байгаа. Ингэснээр Монголын кино үйлдвэрлэгчид улсад ашиг орлогоо шударгаар төлөх, мөн чанартай уран бүтээлийн тоо өсөх боломж бүрдэж байгаа юм.',
}
```
### Data Fields
- `news`: a complete news article on a specific topic as a string
- `label`: the single class of the topic, among these values: "урлаг соёл" (0), "эдийн засаг" (1), "эрүүл мэнд" (2), "хууль" (3), "улс төр" (4), "спорт" (5), "технологи" (6), "боловсрол" (7), "байгал орчин" (8).
### Data Splits
The set of complete articles is split into a training and test set.
## Dataset Creation
### Curation Rationale
[Needs More Information]
### Source Data
#### Initial Data Collection and Normalization
[Needs More Information]
#### Who are the source language producers?
Eduge.mn which is a combination from shuud.mn, ikon.mn, olloo.mn, news.gogo.mn, montsame.mn, zaluu.com, sonin.mn, medee.mn, bloombergtv.mn.
### Annotations
#### Annotation process
[Needs More Information]
#### Who are the annotators?
[Needs More Information]
### Personal and Sensitive Information
[Needs More Information]
## Considerations for Using the Data
### Social Impact of Dataset
[Needs More Information]
### Discussion of Biases
[Needs More Information]
### Other Known Limitations
[Needs More Information]
## Additional Information
### Dataset Curators
[Needs More Information]
### Licensing Information
[Needs More Information]
### Citation Information
No citation available for this dataset.
### Contributions
Thanks to [@enod](https://github.com/enod) for adding this dataset.
提供机构:
tugstugi
原始信息汇总
数据集卡片 for Eduge
数据集描述
数据集摘要
Eduge新闻分类数据集由Bolorsoft LLC提供,用于训练Eduge.mn生产新闻分类器。包含75K篇新闻文章,分为9个类别:文化艺术、经济、健康、法律、政治、体育、技术、教育和环境。
支持的任务和排行榜
text-classification:可以将其转换为9类分类任务。
语言
数据集中的文本为蒙古语。
数据集结构
数据实例
对于default配置:
json
{
label: 0, # 文化艺术
news: Шударга өрсөлдөөн, хэрэглэгчийн төлөө газар 2013 оны дөрөвдүгээр сараас эхлэн Монгол киноны ашиг орлогын мэдээллийг олон нийтэд хүргэж байгаа. Ингэснээр Монголын кино үйлдвэрлэгчид улсад ашиг орлогоо шударгаар төлөх, мөн чанартай уран бүтээлийн тоо өсөх боломж бүрдэж байгаа юм.,
}
数据字段
news:特定主题的完整新闻文章,类型为字符串。label:主题的单一类别,取值范围为:"文化艺术" (0), "经济" (1), "健康" (2), "法律" (3), "政治" (4), "体育" (5), "技术" (6), "教育" (7), "环境" (8)。
数据分割
完整文章集被分为训练集和测试集。
数据集创建
数据来源
初始数据收集和规范化
[需要更多信息]
源语言生产者是谁?
Eduge.mn,结合了shuud.mn, ikon.mn, olloo.mn, news.gogo.mn, montsame.mn, zaluu.com, sonin.mn, medee.mn, bloombergtv.mn。
注释
注释过程
[需要更多信息]
注释者是谁?
[需要更多信息]
个人和敏感信息
[需要更多信息]
使用数据的注意事项
数据集的社会影响
[需要更多信息]
偏见的讨论
[需要更多信息]
其他已知限制
[需要更多信息]
附加信息
数据集策展人
[需要更多信息]
许可信息
[需要更多信息]
引用信息
该数据集没有可用的引用。
贡献
感谢@enod添加此数据集。



