PetraAI/PetraAI
收藏Hugging Face2023-09-14 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/PetraAI/PetraAI
下载链接
链接失效反馈官方服务:
资源简介:
---
license: apache-2.0
task_categories:
- text-classification
- token-classification
- table-question-answering
- question-answering
- zero-shot-classification
- translation
- summarization
- conversational
- feature-extraction
- text-generation
- text2text-generation
- fill-mask
- sentence-similarity
- text-to-speech
- automatic-speech-recognition
- audio-to-audio
- audio-classification
- voice-activity-detection
- depth-estimation
- image-classification
- object-detection
- image-segmentation
- text-to-image
- image-to-text
- image-to-image
- unconditional-image-generation
- video-classification
- reinforcement-learning
- robotics
- tabular-classification
- tabular-regression
- tabular-to-text
- table-to-text
- multiple-choice
- text-retrieval
- time-series-forecasting
- text-to-video
- visual-question-answering
- zero-shot-image-classification
- graph-ml
language:
- ar
- en
tags:
- chemistry
- biology
- finance
- legal
- music
- art
- code
- climate
- medical
pretty_name: PETRA
size_categories:
- 1M<n<10M
---
# PETRA
## Overview
PETRA is a multilingual dataset for training and evaluating AI systems on a diverse range of tasks across multiple modalities. It contains data in Arabic and English for tasks including translation, summarization, question answering, and more.
## Dataset Structure
- Data is separated by language into `/ar` and `/en` directories
- Within each language directory, data is separated by task into subdirectories
- Tasks include:
- Translation
- Summarization
- Conversational
- Feature extraction
- Zero-shot classification
- Text generation
- Fill mask
- Sentence similarity
- Text-to-speech
- Automatic speech recognition
- Text classification
- Token classification
- Table question answering
- Question answering
- Text2text generation
- Audio-to-audio
- Audio classification
- Voice activity detection
- Depth estimation
- Image classification
- Object detection
- Image segmentation
- Text-to-image
- Image-to-text
- Image-to-image
- Unconditional image generation
- Reinforcement learning
- Video classification
- Robotics
- Tabular classification
- Tabular regression
- Table-to-text
- Multiple choice
- Text retrieval
- Tabular-to-text
- Text-to-video
- Time series forecasting
- Visual question answering
- Zero-shot image classification
- Graph ML
## Dataset Tags
- code
- art
- chemistry
- biology
- finance
- legal
- music
- climate
- medical
## Dataset Size
1M < n < 10M samples
## Licenses
Apache 2.0
## Citation
If you use this dataset, please cite it as:
[cite paper, arXiv, etc]
@article{PetraAI2022PetraAI,
title={PetraAI: A Massive Multilingual Dataset for Machine Learning},
author={First Last and First Last},
journal={arXiv},
year={2022},
url={https://huggingface.co/datasets/PetraAI/PetraAI}
}
## Contact
For any questions, please reach out to [shadilytn@gmail.com]
# Dataset Cards
## What are Dataset Cards?
Each dataset may be documented by the `README.md` file in the repository. This file is called a **dataset card**, and the Hugging Face Hub will render its contents on the dataset’s main page. To inform users about how to responsibly use the data, it’s a good idea to include information about any potential biases within the dataset. Generally, dataset cards help users understand the contents of the dataset and give context for how the dataset should be used.
You can also add dataset metadata to your card. The metadata describes important information about a dataset such as its license, language, and size. It also contains tags to help users discover a dataset on the Hub. Tags are defined in a YAML metadata section at the top of the `README.md` file.
## Dataset card metadata
A dataset repo will render its README.md as a dataset card. To control how the Hub displays the card, you should create a YAML section in the README file to define some metadata. Start by adding three --- at the top, then include all of the relevant metadata, and close the section with another group of --- like the example below:
The metadata that you add to the dataset card enables certain interactions on the Hub. For example:
- Allow users to filter and discover datasets at https://huggingface.co/datasets.
- If you choose a license using the keywords listed in the right column of this table, the license will be displayed on the dataset page.
When creating a README.md file in a dataset repository on the Hub, use Metadata UI to fill the main metadata:
To see metadata fields, see the detailed dataset card metadata specification here.
### Dataset card creation guide
For a step-by-step guide on creating a dataset card, check out the Create a dataset card guide.
Reading through existing dataset cards, such as the ELI5 dataset card, is a great way to familiarize yourself with the common conventions.
### Linking a Paper
If the dataset card includes a link to a paper on arXiv, the Hub will extract the arXiv ID and include it in the dataset tags with the format `arxiv:<PAPER ID>`. Clicking on the tag will let you:
- Visit the Paper page
- Filter for other models on the Hub that cite the same paper.
Read more about paper pages here.
https://huggingface.co/docs/hub/paper-pages
提供机构:
PetraAI
原始信息汇总
PETRA 数据集概述
数据集概览
PETRA 是一个多语言数据集,用于训练和评估AI系统在多种任务和多模态数据上的表现。该数据集包含阿拉伯语和英语的数据,覆盖了翻译、摘要、问答等多种任务。
数据集结构
- 数据按语言分为
/ar和/en目录。 - 每个语言目录内,数据按任务进一步细分到子目录。
- 任务包括但不限于:
- 翻译
- 摘要
- 对话
- 特征提取
- 零样本分类
- 文本生成
- 填充掩码
- 句子相似度
- 文本到语音
- 自动语音识别
- 文本分类
- 令牌分类
- 表格问答
- 问答
- 文本到文本生成
- 音频到音频
- 音频分类
- 语音活动检测
- 深度估计
- 图像分类
- 对象检测
- 图像分割
- 文本到图像
- 图像到文本
- 图像到图像
- 无条件图像生成
- 强化学习
- 视频分类
- 机器人学
- 表格分类
- 表格回归
- 表格到文本
- 多选题
- 文本检索
- 表格到文本
- 文本到视频
- 时间序列预测
- 视觉问答
- 零样本图像分类
- 图机器学习
数据集标签
- 代码
- 艺术
- 化学
- 生物学
- 金融
- 法律
- 音乐
- 气候
- 医学
数据集大小
数据集包含的样本数量在100万到1000万之间。
许可证
数据集遵循Apache 2.0许可证。
引用信息
使用此数据集时,请引用:
@article{PetraAI2022PetraAI, title={PetraAI: A Massive Multilingual Dataset for Machine Learning}, author={First Last and First Last}, journal={arXiv}, year={2022}, url={https://huggingface.co/datasets/PetraAI/PetraAI} }



