JonaszPotoniec/dowcipy-polish-jokes-dataset

Name: JonaszPotoniec/dowcipy-polish-jokes-dataset
Creator: JonaszPotoniec
Published: 2024-02-15 21:54:24
License: 暂无描述

Hugging Face2024-02-15 更新2024-03-04 收录

下载链接：

https://hf-mirror.com/datasets/JonaszPotoniec/dowcipy-polish-jokes-dataset

下载链接

链接失效反馈

官方服务：

资源简介：

--- dataset_info: features: - name: joke dtype: string - name: upvotes dtype: int64 - name: downvotes dtype: int64 splits: - name: train num_bytes: 3074127 num_examples: 9020 download_size: 2061760 dataset_size: 3074127 configs: - config_name: default data_files: - split: train path: data/train-* license: mit task_categories: - text-generation language: - pl pretty_name: Dowcipy jaja tags: - art size_categories: - 1K<n<10K --- # Dataset consisting of polish jokes ## Warning: Jokes were not curated, some may be offensive, stupid or simply not funny. It's highly recommended to filter jokes before training, e.g., based on downvotes This dataset consists of all (9k) jokes dumped from [jeja.pl](https://dowcipy.jeja.pl/) on 2024-02-14. Jokes are submitted by the community. Besides _the funny_ text itself, I included upvotes and downvotes. You can use them for filtering. Default sorting is based on a combination of downvotes and upvotes. If used for training LLMs, it's recommended to use a tokenizer that supports line breaks, as these are often important for readability of the jokes. ## Where to find me - [Github](https://github.com/JonaszPotoniec) - [Linkedin](https://www.linkedin.com/in/jonasz-potoniec/) - [E-mail](mailto:jonasz@potoniec.eu) - [Telegram](https://t.me/JonaszPotoniec)

提供机构：

JonaszPotoniec

原始信息汇总

数据集概述

数据集信息

特征:
- joke: 类型为 string
- upvotes: 类型为 int64
- downvotes: 类型为 int64
分割:
- train: 字节数为 3074127，样本数为 9020
下载大小: 2061760 字节
数据集大小: 3074127 字节
配置:
- default
  - 数据文件:
    - train: 路径为 data/train-*
许可证: MIT
任务类别:
- text-generation
语言:
- pl
美观名称: Dowcipy jaja
标签:
- art
大小类别:
- 1K<n<10K

数据集描述

该数据集包含从 jeja.pl 于 2024-02-14 收集的所有 9000 个波兰笑话。
笑话由社区提交，除了笑话文本本身，还包括点赞数和踩数，可用于过滤。
默认排序基于点赞数和踩数的组合。
如果用于训练语言模型，建议使用支持换行符的标记器，因为换行符对于笑话的可读性很重要。

警告

笑话未经筛选，可能包含冒犯性、愚蠢或不幽默的内容，强烈建议在训练前根据踩数进行过滤。

5,000+

优质数据集

54 个

任务类型

进入经典数据集