cahya/id_panl_bppt
收藏Hugging Face2024-01-18 更新2024-05-25 收录
下载链接:
https://hf-mirror.com/datasets/cahya/id_panl_bppt
下载链接
链接失效反馈官方服务:
资源简介:
---
annotations_creators:
- expert-generated
language_creators:
- expert-generated
language:
- en
- id
license:
- unknown
multilinguality:
- translation
size_categories:
- 10K<n<100K
source_datasets:
- original
task_categories:
- translation
task_ids: []
pretty_name: IdPanlBppt
dataset_info:
features:
- name: id
dtype: string
- name: translation
dtype:
translation:
languages:
- en
- id
- name: topic
dtype:
class_label:
names:
'0': Economy
'1': International
'2': Science
'3': Sport
config_name: id_panl_bppt
splits:
- name: train
num_bytes: 7455924
num_examples: 24021
download_size: 2366973
dataset_size: 7455924
---
# Dataset Card for [Dataset Name]
## Table of Contents
- [Dataset Description](#dataset-description)
- [Dataset Summary](#dataset-summary)
- [Supported Tasks and Leaderboards](#supported-tasks-and-leaderboards)
- [Languages](#languages)
- [Dataset Structure](#dataset-structure)
- [Data Instances](#data-instances)
- [Data Fields](#data-fields)
- [Data Splits](#data-splits)
- [Dataset Creation](#dataset-creation)
- [Curation Rationale](#curation-rationale)
- [Source Data](#source-data)
- [Annotations](#annotations)
- [Personal and Sensitive Information](#personal-and-sensitive-information)
- [Considerations for Using the Data](#considerations-for-using-the-data)
- [Social Impact of Dataset](#social-impact-of-dataset)
- [Discussion of Biases](#discussion-of-biases)
- [Other Known Limitations](#other-known-limitations)
- [Additional Information](#additional-information)
- [Dataset Curators](#dataset-curators)
- [Licensing Information](#licensing-information)
- [Citation Information](#citation-information)
- [Contributions](#contributions)
## Dataset Description
- **Homepage:** [PANL BPPT](http://digilib.bppt.go.id/sampul/p92-budiono.pdf)
- **Repository:** [PANL BPPT Repository](https://github.com/cahya-wirawan/indonesian-language-models/raw/master/data/BPPTIndToEngCorpusHalfM.zip)
- **Paper:** [Resource Report: Building Parallel Text Corpora for Multi-Domain Translation System](http://digilib.bppt.go.id/sampul/p92-budiono.pdf)
- **Leaderboard:**
- **Point of Contact:**
### Dataset Summary
Parallel Text Corpora for Multi-Domain Translation System created by BPPT (Indonesian Agency for the Assessment and
Application of Technology) for PAN Localization Project (A Regional Initiative to Develop Local Language Computing
Capacity in Asia). The dataset contains around 24K sentences divided in 4 difference topics (Economic, international,
Science and Technology and Sport).
### Supported Tasks and Leaderboards
[More Information Needed]
### Languages
Indonesian
## Dataset Structure
[More Information Needed]
### Data Instances
An example of the dataset:
```
{
'id': '0',
'topic': 0,
'translation':
{
'en': 'Minister of Finance Sri Mulyani Indrawati said that a sharp correction of the composite
inde x by up to 4 pct in Wedenesday?s trading was a mere temporary effect of regional factors like
decline in plantation commodity prices and the financial crisis in Thailand.',
'id': 'Menteri Keuangan Sri Mulyani mengatakan koreksi tajam pada Indeks Harga Saham Gabungan
IHSG hingga sekitar 4 persen dalam perdagangan Rabu 10/1 hanya efek sesaat dari faktor-faktor regional
seperti penurunan harga komoditi perkebunan dan krisis finansial di Thailand.'
}
}
```
### Data Fields
- `id`: id of the sample
- `translation`: the parallel sentence english-indonesian
- `topic`: the topic of the sentence. It could be one of the following:
- Economic
- International
- Science and Technology
- Sport
### Data Splits
The dataset is splitted in to train, validation and test sets.
## Dataset Creation
### Curation Rationale
[More Information Needed]
### Source Data
#### Initial Data Collection and Normalization
[More Information Needed]
#### Who are the source language producers?
[More Information Needed]
### Annotations
#### Annotation process
[More Information Needed]
#### Who are the annotators?
[More Information Needed]
### Personal and Sensitive Information
[More Information Needed]
## Considerations for Using the Data
### Social Impact of Dataset
[More Information Needed]
### Discussion of Biases
[More Information Needed]
### Other Known Limitations
[More Information Needed]
## Additional Information
### Dataset Curators
[More Information Needed]
### Licensing Information
[More Information Needed]
### Citation Information
```
@inproceedings{id_panl_bppt,
author = {PAN Localization - BPPT},
title = {Parallel Text Corpora, English Indonesian},
year = {2009},
url = {http://digilib.bppt.go.id/sampul/p92-budiono.pdf},
}
```
### Contributions
Thanks to [@cahya-wirawan](https://github.com/cahya-wirawan) for adding this dataset.
提供机构:
cahya
原始信息汇总
数据集概述
数据集基本信息
- 名称: IdPanlBppt
- 语言: 英语 (en), 印度尼西亚语 (id)
- 许可证: 未知
- 多语言性: 翻译
- 大小: 10K<n<100K
- 源数据集: 原始数据
- 任务类别: 翻译
数据集结构
- 特征:
id: 字符串类型translation: 翻译特征,包含英语和印度尼西亚语topic: 分类标签,包括经济、国际、科学、体育四个主题
- 配置名称: id_panl_bppt
- 数据分割:
train: 24021个样本,数据大小为7455924字节- 下载大小: 2366973字节
- 数据集总大小: 7455924字节
数据实例
-
示例:
{ id: 0, topic: 0, translation: { en: ..., id: ... } }
数据字段
id: 样本IDtranslation: 平行句,包含英语和印度尼西亚语topic: 句子主题,包括经济、国际、科学、体育
数据分割
- 数据集被分割为训练、验证和测试集。
搜集汇总
数据集介绍

以上内容由遇见数据集搜集并总结生成



