five

tugstugi/eduge

收藏
Hugging Face2024-01-18 更新2024-05-25 收录
下载链接:
https://hf-mirror.com/datasets/tugstugi/eduge
下载链接
链接失效反馈
官方服务:
资源简介:
--- annotations_creators: - expert-generated language_creators: - expert-generated language: - mn license: - unknown multilinguality: - monolingual size_categories: - 10K<n<100K source_datasets: - original task_categories: - text-classification task_ids: - multi-class-classification pretty_name: Eduge dataset_info: features: - name: news dtype: string - name: label dtype: class_label: names: '0': урлаг соёл '1': эдийн засаг '2': эрүүл мэнд '3': хууль '4': улс төр '5': спорт '6': технологи '7': боловсрол '8': байгал орчин splits: - name: train num_bytes: 255275842 num_examples: 60528 - name: test num_bytes: 64451731 num_examples: 15133 download_size: 320395067 dataset_size: 319727573 --- # Dataset Card for Eduge ## Table of Contents - [Dataset Description](#dataset-description) - [Dataset Summary](#dataset-summary) - [Supported Tasks and Leaderboards](#supported-tasks-and-leaderboards) - [Languages](#languages) - [Dataset Structure](#dataset-structure) - [Data Instances](#data-instances) - [Data Fields](#data-fields) - [Data Splits](#data-splits) - [Dataset Creation](#dataset-creation) - [Curation Rationale](#curation-rationale) - [Source Data](#source-data) - [Annotations](#annotations) - [Personal and Sensitive Information](#personal-and-sensitive-information) - [Considerations for Using the Data](#considerations-for-using-the-data) - [Social Impact of Dataset](#social-impact-of-dataset) - [Discussion of Biases](#discussion-of-biases) - [Other Known Limitations](#other-known-limitations) - [Additional Information](#additional-information) - [Dataset Curators](#dataset-curators) - [Licensing Information](#licensing-information) - [Citation Information](#citation-information) - [Contributions](#contributions) ## Dataset Description - **Homepage:** http://eduge.mn/ - **Repository:** https://github.com/tugstugi/mongolian-nlp - **Paper:** [Needs More Information] - **Leaderboard:** [Needs More Information] - **Point of Contact:** [Needs More Information] ### Dataset Summary Eduge news classification dataset provided by Bolorsoft LLC. Used to train the Eduge.mn production news classifier 75K news articles in 9 categories: урлаг соёл, эдийн засаг, эрүүл мэнд, хууль, улс төр, спорт, технологи, боловсрол and байгал орчин ### Supported Tasks and Leaderboards - `text-classification`: We can transform the above into a 9-class classification task. ### Languages The text in the dataset is in Mongolian ## Dataset Structure ### Data Instances For the `default` configuration: ``` { 'label': 0, # 'урлаг соёл' 'news': 'Шударга өрсөлдөөн, хэрэглэгчийн төлөө газар 2013 оны дөрөвдүгээр сараас эхлэн Монгол киноны ашиг орлогын мэдээллийг олон нийтэд хүргэж байгаа. Ингэснээр Монголын кино үйлдвэрлэгчид улсад ашиг орлогоо шударгаар төлөх, мөн  чанартай уран бүтээлийн тоо өсөх боломж бүрдэж байгаа юм.', } ``` ### Data Fields - `news`: a complete news article on a specific topic as a string - `label`: the single class of the topic, among these values: "урлаг соёл" (0), "эдийн засаг" (1), "эрүүл мэнд" (2), "хууль" (3), "улс төр" (4), "спорт" (5), "технологи" (6), "боловсрол" (7), "байгал орчин" (8). ### Data Splits The set of complete articles is split into a training and test set. ## Dataset Creation ### Curation Rationale [Needs More Information] ### Source Data #### Initial Data Collection and Normalization [Needs More Information] #### Who are the source language producers? Eduge.mn which is a combination from shuud.mn, ikon.mn, olloo.mn, news.gogo.mn, montsame.mn, zaluu.com, sonin.mn, medee.mn, bloombergtv.mn. ### Annotations #### Annotation process [Needs More Information] #### Who are the annotators? [Needs More Information] ### Personal and Sensitive Information [Needs More Information] ## Considerations for Using the Data ### Social Impact of Dataset [Needs More Information] ### Discussion of Biases [Needs More Information] ### Other Known Limitations [Needs More Information] ## Additional Information ### Dataset Curators [Needs More Information] ### Licensing Information [Needs More Information] ### Citation Information No citation available for this dataset. ### Contributions Thanks to [@enod](https://github.com/enod) for adding this dataset.
提供机构:
tugstugi
原始信息汇总

数据集卡片 for Eduge

数据集描述

数据集摘要

Eduge新闻分类数据集由Bolorsoft LLC提供,用于训练Eduge.mn生产新闻分类器。包含75K篇新闻文章,分为9个类别:文化艺术、经济、健康、法律、政治、体育、技术、教育和环境。

支持的任务和排行榜

  • text-classification:可以将其转换为9类分类任务。

语言

数据集中的文本为蒙古语。

数据集结构

数据实例

对于default配置: json { label: 0, # 文化艺术 news: Шударга өрсөлдөөн, хэрэглэгчийн төлөө газар 2013 оны дөрөвдүгээр сараас эхлэн Монгол киноны ашиг орлогын мэдээллийг олон нийтэд хүргэж байгаа. Ингэснээр Монголын кино үйлдвэрлэгчид улсад ашиг орлогоо шударгаар төлөх, мөн  чанартай уран бүтээлийн тоо өсөх боломж бүрдэж байгаа юм., }

数据字段

  • news:特定主题的完整新闻文章,类型为字符串。
  • label:主题的单一类别,取值范围为:"文化艺术" (0), "经济" (1), "健康" (2), "法律" (3), "政治" (4), "体育" (5), "技术" (6), "教育" (7), "环境" (8)。

数据分割

完整文章集被分为训练集和测试集。

数据集创建

数据来源

初始数据收集和规范化

[需要更多信息]

源语言生产者是谁?

Eduge.mn,结合了shuud.mn, ikon.mn, olloo.mn, news.gogo.mn, montsame.mn, zaluu.com, sonin.mn, medee.mn, bloombergtv.mn。

注释

注释过程

[需要更多信息]

注释者是谁?

[需要更多信息]

个人和敏感信息

[需要更多信息]

使用数据的注意事项

数据集的社会影响

[需要更多信息]

偏见的讨论

[需要更多信息]

其他已知限制

[需要更多信息]

附加信息

数据集策展人

[需要更多信息]

许可信息

[需要更多信息]

引用信息

该数据集没有可用的引用。

贡献

感谢@enod添加此数据集。

5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作