cahya/id_panl_bppt

Name: cahya/id_panl_bppt
Creator: cahya
Published: 2024-01-18 11:06:12
License: 暂无描述

Hugging Face2024-01-18 更新2024-05-25 收录

下载链接：

https://hf-mirror.com/datasets/cahya/id_panl_bppt

下载链接

链接失效反馈

官方服务：

资源简介：

--- annotations_creators: - expert-generated language_creators: - expert-generated language: - en - id license: - unknown multilinguality: - translation size_categories: - 10K<n<100K source_datasets: - original task_categories: - translation task_ids: [] pretty_name: IdPanlBppt dataset_info: features: - name: id dtype: string - name: translation dtype: translation: languages: - en - id - name: topic dtype: class_label: names: '0': Economy '1': International '2': Science '3': Sport config_name: id_panl_bppt splits: - name: train num_bytes: 7455924 num_examples: 24021 download_size: 2366973 dataset_size: 7455924 --- # Dataset Card for [Dataset Name] ## Table of Contents - [Dataset Description](#dataset-description) - [Dataset Summary](#dataset-summary) - [Supported Tasks and Leaderboards](#supported-tasks-and-leaderboards) - [Languages](#languages) - [Dataset Structure](#dataset-structure) - [Data Instances](#data-instances) - [Data Fields](#data-fields) - [Data Splits](#data-splits) - [Dataset Creation](#dataset-creation) - [Curation Rationale](#curation-rationale) - [Source Data](#source-data) - [Annotations](#annotations) - [Personal and Sensitive Information](#personal-and-sensitive-information) - [Considerations for Using the Data](#considerations-for-using-the-data) - [Social Impact of Dataset](#social-impact-of-dataset) - [Discussion of Biases](#discussion-of-biases) - [Other Known Limitations](#other-known-limitations) - [Additional Information](#additional-information) - [Dataset Curators](#dataset-curators) - [Licensing Information](#licensing-information) - [Citation Information](#citation-information) - [Contributions](#contributions) ## Dataset Description - **Homepage:** [PANL BPPT](http://digilib.bppt.go.id/sampul/p92-budiono.pdf) - **Repository:** [PANL BPPT Repository](https://github.com/cahya-wirawan/indonesian-language-models/raw/master/data/BPPTIndToEngCorpusHalfM.zip) - **Paper:** [Resource Report: Building Parallel Text Corpora for Multi-Domain Translation System](http://digilib.bppt.go.id/sampul/p92-budiono.pdf) - **Leaderboard:** - **Point of Contact:** ### Dataset Summary Parallel Text Corpora for Multi-Domain Translation System created by BPPT (Indonesian Agency for the Assessment and Application of Technology) for PAN Localization Project (A Regional Initiative to Develop Local Language Computing Capacity in Asia). The dataset contains around 24K sentences divided in 4 difference topics (Economic, international, Science and Technology and Sport). ### Supported Tasks and Leaderboards [More Information Needed] ### Languages Indonesian ## Dataset Structure [More Information Needed] ### Data Instances An example of the dataset: ``` { 'id': '0', 'topic': 0, 'translation': { 'en': 'Minister of Finance Sri Mulyani Indrawati said that a sharp correction of the composite inde x by up to 4 pct in Wedenesday?s trading was a mere temporary effect of regional factors like decline in plantation commodity prices and the financial crisis in Thailand.', 'id': 'Menteri Keuangan Sri Mulyani mengatakan koreksi tajam pada Indeks Harga Saham Gabungan IHSG hingga sekitar 4 persen dalam perdagangan Rabu 10/1 hanya efek sesaat dari faktor-faktor regional seperti penurunan harga komoditi perkebunan dan krisis finansial di Thailand.' } } ``` ### Data Fields - `id`: id of the sample - `translation`: the parallel sentence english-indonesian - `topic`: the topic of the sentence. It could be one of the following: - Economic - International - Science and Technology - Sport ### Data Splits The dataset is splitted in to train, validation and test sets. ## Dataset Creation ### Curation Rationale [More Information Needed] ### Source Data #### Initial Data Collection and Normalization [More Information Needed] #### Who are the source language producers? [More Information Needed] ### Annotations #### Annotation process [More Information Needed] #### Who are the annotators? [More Information Needed] ### Personal and Sensitive Information [More Information Needed] ## Considerations for Using the Data ### Social Impact of Dataset [More Information Needed] ### Discussion of Biases [More Information Needed] ### Other Known Limitations [More Information Needed] ## Additional Information ### Dataset Curators [More Information Needed] ### Licensing Information [More Information Needed] ### Citation Information ``` @inproceedings{id_panl_bppt, author = {PAN Localization - BPPT}, title = {Parallel Text Corpora, English Indonesian}, year = {2009}, url = {http://digilib.bppt.go.id/sampul/p92-budiono.pdf}, } ``` ### Contributions Thanks to [@cahya-wirawan](https://github.com/cahya-wirawan) for adding this dataset.

提供机构：

cahya

原始信息汇总

数据集概述

数据集基本信息

名称: IdPanlBppt
语言: 英语 (en), 印度尼西亚语 (id)
许可证: 未知
多语言性: 翻译
大小: 10K<n<100K
源数据集: 原始数据
任务类别: 翻译

数据集结构

特征:
- id: 字符串类型
- translation: 翻译特征，包含英语和印度尼西亚语
- topic: 分类标签，包括经济、国际、科学、体育四个主题
配置名称: id_panl_bppt
数据分割:
- train: 24021个样本，数据大小为7455924字节
- 下载大小: 2366973字节
- 数据集总大小: 7455924字节

数据实例

示例:

{ id: 0, topic: 0, translation: { en: ..., id: ... } }

数据字段

id: 样本ID
translation: 平行句，包含英语和印度尼西亚语
topic: 句子主题，包括经济、国际、科学、体育

数据分割

数据集被分割为训练、验证和测试集。

搜集汇总

数据集介绍

以上内容由遇见数据集搜集并总结生成

5,000+

优质数据集

54 个

任务类型

进入经典数据集