IVN-RIN/BioBERT_Italian

Name: IVN-RIN/BioBERT_Italian
Creator: IVN-RIN
Published: 2024-09-20 07:45:11
License: 暂无描述

Hugging Face2024-09-20 更新2024-06-12 收录

下载链接：

https://hf-mirror.com/datasets/IVN-RIN/BioBERT_Italian

下载链接

链接失效反馈

官方服务：

资源简介：

BioBERT-ITA数据集是原始BioBERT数据集的意大利语翻译，包含数百万篇PubMed论文的摘要。由于缺乏意大利语的等效生物医学文献，研究人员使用机器翻译技术生成了基于PubMed摘要的意大利语生物医学语料库，并用于训练BioBIT模型。数据集的特征包括文本字符串，分为训练集，包含17,203,146个样本，总大小为27,319,024,484字节。数据集的任务类别为文本生成，语言为意大利语，标签为医学和生物学，规模类别为1B<n<10B。

The BioBERT-ITA dataset is the Italian translation of the original BioBERT dataset, composed of millions of abstracts from PubMed papers. Due to the unavailability of an Italian equivalent for the biomedical literature, researchers used machine translation to generate an Italian biomedical corpus based on PubMed abstracts, which was then used to train the BioBIT model. The dataset features text strings, divided into a training set containing 17,203,146 samples, with a total size of 27,319,024,484 bytes. The task category of the dataset is text generation, the language is Italian, the tags are medical and biology, and the size category is 1B<n<10B.

提供机构：

IVN-RIN

原始信息汇总

数据集概述

基本信息

名称: BioBERT-ITA
许可证: cc-by-sa-4.0

数据集特征

特征:
- text (数据类型: string)

数据集拆分

训练集:
- num_examples: 17203146
- num_bytes: 27319024484

数据集大小

下载大小: 14945984639
数据集大小: 27319024484

配置

默认配置:
- data_files:
  - split: train
  - path: data/train-*

任务类别

文本生成

语言

意大利语

大小类别

1B<n<10B

统计数据

总令牌数: 6.2 billions
平均令牌数/示例: 359
最大令牌数/示例: 2132
最小令牌数/示例: 5
标准差: 137

5,000+

优质数据集

54 个

任务类型

进入经典数据集