five

sbunlp/hmblogs-v3

收藏
Hugging Face2024-02-24 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/sbunlp/hmblogs-v3
下载链接
链接失效反馈
官方服务:
资源简介:
--- dataset_info: features: - name: text dtype: string splits: - name: train num_bytes: 45957987986 num_examples: 16896817 download_size: 21312867175 dataset_size: 45957987986 configs: - config_name: default data_files: - split: train path: data/train-* task_categories: - text-generation language: - fa pretty_name: 'HmBlogs: A big general Persian corpus' size_categories: - 10M<n<100M --- # HmBlogs: A big general Persian corpus HmBlogs is a general Persian corpus collected from nearly 20 million blog posts over a period of 15 years containig 6.8 billion tokens. This version is the **preprocessed version** of the dataset prepared by the original authors and converted to proper format to integrate with 🤗Datasets. In order to access the raw versions visit the official link at http://nlplab.sbu.ac.ir/hmBlogs-v3 . **Paper:** https://arxiv.org/abs/2111.02362 <br> **Authors:** Hamzeh Motahari Khansari, Mehrnoush Shamsfard <br> **Original Link:** http://nlplab.sbu.ac.ir/hmBlogs-v3/<br> ## Usage This dataset can be used for masked/causal language modeling. You can easily load this dataset like below: ```python from datasets import load_dataset # Load the whole dataset dataset = load_dataset("sbunlp/hmblogs-v3", split="train") # Load a portion by % dataset = load_dataset("sbunlp/hmblogs-v3", split="train[:50%]") # Load a custom shard dataset = load_dataset("sbunlp/hmblogs-v3", data_files=["data/train-00000-of-00046.parquet", "data/train-00001-of-00046.parquet"]) ``` # Citation ```cite @article{DBLP:journals/corr/abs-2111-02362, author = {Hamzeh Motahari Khansari and Mehrnoush Shamsfard}, title = {HmBlogs: {A} big general Persian corpus}, journal = {CoRR}, volume = {abs/2111.02362}, year = {2021}, url = {https://arxiv.org/abs/2111.02362}, eprinttype = {arXiv}, eprint = {2111.02362}, timestamp = {Fri, 05 Nov 2021 15:25:54 +0100}, biburl = {https://dblp.org/rec/journals/corr/abs-2111-02362.bib}, bibsource = {dblp computer science bibliography, https://dblp.org} } ```
提供机构:
sbunlp
原始信息汇总

HmBlogs: A big general Persian corpus

数据集概述

  • 名称: HmBlogs
  • 描述: 一个从近2000万篇博客文章中收集的通用波斯语语料库,涵盖15年,包含68亿个词元。此版本是原始作者预处理并转换为适当格式的版本,以便与🤗Datasets集成。

数据集信息

  • 特征:
    • text: 数据类型为字符串
  • 分割:
    • train: 包含16,896,817个样本,总字节数为45,957,987,986字节
  • 下载大小: 21,312,867,175字节
  • 数据集大小: 45,957,987,986字节
  • 配置:
    • default: 数据文件路径为data/train-*
  • 任务类别: 文本生成
  • 语言: 波斯语
  • 友好名称: HmBlogs: A big general Persian corpus
  • 大小类别: 10M<n<100M

使用方法

  • 可以通过以下方式加载数据集: python from datasets import load_dataset

    加载整个数据集

    dataset = load_dataset("sbunlp/hmblogs-v3", split="train")

    按百分比加载部分数据集

    dataset = load_dataset("sbunlp/hmblogs-v3", split="train[:50%]")

    加载自定义分片

    dataset = load_dataset("sbunlp/hmblogs-v3", data_files=["data/train-00000-of-00046.parquet", "data/train-00001-of-00046.parquet"])

引用

cite @article{DBLP:journals/corr/abs-2111-02362, author = {Hamzeh Motahari Khansari and Mehrnoush Shamsfard}, title = {HmBlogs: {A} big general Persian corpus}, journal = {CoRR}, volume = {abs/2111.02362}, year = {2021}, url = {https://arxiv.org/abs/2111.02362}, eprinttype = {arXiv}, eprint = {2111.02362}, timestamp = {Fri, 05 Nov 2021 15:25:54 +0100}, biburl = {https://dblp.org/rec/journals/corr/abs-2111-02362.bib}, bibsource = {dblp computer science bibliography, https://dblp.org} }

搜集汇总
数据集介绍
main_image_url
以上内容由遇见数据集搜集并总结生成
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作