scostiniano/storytelling_books_filipino
收藏Hugging Face2022-11-24 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/scostiniano/storytelling_books_filipino
下载链接
链接失效反馈官方服务:
资源简介:
该数据集包含148本菲律宾故事书,总计5,005个句子,45,792个标记和5,646个唯一标记。该NER模型目前仅支持菲律宾语,并且不包括专有名词、动词、形容词和副词。输入数据需要经过预处理,预处理代码将上传到GitHub。数据集的结构包括数据实例和字段描述,数据实例展示了标记和标签的示例,字段描述包括标记和标签的序列。数据集被分为训练集和验证集,训练集包含3,279个样本,验证集包含1,244个样本。
This dataset comprises 148 Philippine storybooks, with a total of 5,005 sentences, 45,792 tokens, and 5,646 unique tokens. This NER model currently only supports Filipino, and does not cover proper nouns, verbs, adjectives, and adverbs. Input data must undergo preprocessing, and the preprocessing code will be uploaded to GitHub. The dataset structure consists of data instances and field descriptions: data instances showcase examples of tokens and labels, while field descriptions detail sequences of tokens and labels. The dataset is split into training and validation sets, with the training set containing 3,279 samples and the validation set comprising 1,244 samples.
提供机构:
scostiniano
原始信息汇总
Filipino Storytelling Books 数据集概述
数据集描述
该数据集包含148本菲律宾故事书,总计5,005个句子,45,792个总词数,以及5,646个独特词数。此NER模型目前仅支持菲律宾语,且不包括专有名词、动词、形容词和副词。
语言信息
- 语言:菲律宾语
- BCP-47代码:菲律宾语
数据集结构
数据实例
示例数据结构如下: json [ { "tokens": [ "toot" ], "tags": [ 3 ] }, { "tokens": [ "hindi", "ako", "yun", "boltu" ], "tags": [ 3, 1, 3, 3 ] } ]
数据集字段
tokens: 字符串序列tags: 类别标签序列,包含7个类别:[Animals, Humans_Body, Natural_Environment, O, Objects, Transportation, Urban_Environment]
数据集分割
- 训练集:3279个样本
- 验证集:1244个样本



