magyar-nlp-szine-java/reddit
收藏Hugging Face2026-02-06 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/magyar-nlp-szine-java/reddit
下载链接
链接失效反馈官方服务:
资源简介:
---
license: cc-by-4.0
language:
- hu
---
# Reddit Dataset (Semantic Chunks)
Hungarian Reddit conversations dataset preprocessed with semantic chunking.
## Stats
| | |
|---|---|
| **Rows** | 1,066,356 |
| **Tokens** | 42,313,152 |
| **Tokenizer** | `magyar-nlp-szine-java/exotic_modernbert_128k_tokenizer_modified` |
## Columns
- `text` - Chunked text content
- `token_count` - Token count per chunk
- `source_id` - Original source row index
- `chunk_id` - Unique chunk identifier
- `subreddit` - Source subreddit
- `type` - Submission or comment
提供机构:
magyar-nlp-szine-java



