JonaszPotoniec/dowcipy-polish-jokes-dataset
收藏Hugging Face2024-02-15 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/JonaszPotoniec/dowcipy-polish-jokes-dataset
下载链接
链接失效反馈官方服务:
资源简介:
---
dataset_info:
features:
- name: joke
dtype: string
- name: upvotes
dtype: int64
- name: downvotes
dtype: int64
splits:
- name: train
num_bytes: 3074127
num_examples: 9020
download_size: 2061760
dataset_size: 3074127
configs:
- config_name: default
data_files:
- split: train
path: data/train-*
license: mit
task_categories:
- text-generation
language:
- pl
pretty_name: Dowcipy jaja
tags:
- art
size_categories:
- 1K<n<10K
---
# Dataset consisting of polish jokes
## Warning: Jokes were not curated, some may be offensive, stupid or simply not funny. It's highly recommended to filter jokes before training, e.g., based on downvotes
This dataset consists of all (9k) jokes dumped from [jeja.pl](https://dowcipy.jeja.pl/) on 2024-02-14. Jokes are submitted by the community. Besides _the funny_ text itself, I included upvotes and downvotes. You can use them for filtering.
Default sorting is based on a combination of downvotes and upvotes.
If used for training LLMs, it's recommended to use a tokenizer that supports line breaks, as these are often important for readability of the jokes.
## Where to find me
- [Github](https://github.com/JonaszPotoniec)
- [Linkedin](https://www.linkedin.com/in/jonasz-potoniec/)
- [E-mail](mailto:jonasz@potoniec.eu)
- [Telegram](https://t.me/JonaszPotoniec)
提供机构:
JonaszPotoniec
原始信息汇总
数据集概述
数据集信息
- 特征:
joke: 类型为stringupvotes: 类型为int64downvotes: 类型为int64
- 分割:
train: 字节数为 3074127,样本数为 9020
- 下载大小: 2061760 字节
- 数据集大小: 3074127 字节
- 配置:
default- 数据文件:
train: 路径为data/train-*
- 数据文件:
- 许可证: MIT
- 任务类别:
text-generation
- 语言:
pl
- 美观名称: Dowcipy jaja
- 标签:
art
- 大小类别:
1K<n<10K
数据集描述
- 该数据集包含从 jeja.pl 于 2024-02-14 收集的所有 9000 个波兰笑话。
- 笑话由社区提交,除了笑话文本本身,还包括点赞数和踩数,可用于过滤。
- 默认排序基于点赞数和踩数的组合。
- 如果用于训练语言模型,建议使用支持换行符的标记器,因为换行符对于笑话的可读性很重要。
警告
- 笑话未经筛选,可能包含冒犯性、愚蠢或不幽默的内容,强烈建议在训练前根据踩数进行过滤。



