alexliap/high-quality-gr-text
收藏Hugging Face2026-02-02 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/alexliap/high-quality-gr-text
下载链接
链接失效反馈官方服务:
资源简介:
---
language:
- el
license: apache-2.0
size_categories:
- 1M<n<10M
task_categories:
- text-generation
configs:
- config_name: finepdfs_el
data_files: finepdfs_el/*.parquet
- config_name: fineweb_hq_el
data_files: fineweb_hq_el/*.parquet
- config_name: finewiki_el
data_files: finewiki_el/*.parquet
- config_name: wikipedia_el
data_files: wikipedia_el/*.parquet
tags:
- llms
- pretraining
---
This dataset contains Greek language text data from multiple high-quality sources.
## Dataset Statistics
- **Total tokens:** ~21.1 billion (GPT-4 tokenizer)
- **Total records:** 5,032,854
### Token Distribution
- FineWeb2-HQ Greek: 14.6B tokens (68.9%)
- FinePDFs-Edu Greek: 5.1B tokens (24.0%)
- Wikipedia Greek: 752M tokens (3.6%)
- FineWiki Greek: 745M tokens (3.5%)
## Dataset Structure
The dataset consists of 4 subsets, each representing a different data source:
### finepdfs_el
- **Files:** 1
- **Size:** 4.32 GB
- **Source:** FinePDFs-Edu - Educational PDF content in Greek
- **Repository:** [HuggingFaceFW/finepdfs-edu](https://huggingface.co/datasets/HuggingFaceFW/finepdfs-edu)
### fineweb_hq_el
- **Files:** 2
- **Size:** 8.68 GB
- **Source:** FineWeb2-HQ - Filtered high-quality Greek web content
- **Repository:** [epfml/FineWeb2-HQ](https://huggingface.co/datasets/epfml/FineWeb2-HQ)
### finewiki_el
- **Files:** 1
- **Size:** 0.43 GB
- **Source:** FineWiki - High-quality Greek Wikipedia articles
- **Repository:** [HuggingFaceFW/finewiki](https://huggingface.co/datasets/HuggingFaceFW/finewiki)
### wikipedia_el
- **Files:** 1
- **Size:** 0.45 GB
- **Source:** Greek Wikipedia and Wikisource
- **Snapshots:** Wikipedia (20231101.el) + Wikisource (20231201.el)
- **Repository:** [wikimedia/wikipedia](https://huggingface.co/datasets/wikimedia/wikipedia)
## Schema
Each record in the dataset contains the following fields:
- **text** (string): The text content
- **url** (string): Source URL of the content
- **language** (string): Language code (always 'el' for Greek)
- **source** (string): Source dataset name (finewiki_el, fineweb_hq_el, finepdfs_el, or wikipedia_el)
- **token_count** (integer): Number of GPT-4 tokens in the text
## Usage
Load a specific subset:
```python
from datasets import load_dataset
# Load FineWiki Greek subset
ds = load_dataset("alexliap/high-quality-gr-text", "finewiki_el", split="train")
# Load FineWeb2-HQ Greek subset
ds = load_dataset("alexliap/high-quality-gr-text", "fineweb_hq_el", split="train")
# Access text and token count
print(ds[0]["text"])
print(ds[0]["token_count"])
```
## License
Apache 2.0 (inherits from source datasets)
## Code Repository
The code used to build this dataset is available on GitHub:
[https://github.com/alexliap/high-quality-gr-text](https://github.com/alexliap/high-quality-gr-text)
## Citation
If you use this dataset, please cite the original sources:
- FineWiki: HuggingFaceFW/finewiki
- FineWeb2-HQ: epfml/FineWeb2-HQ
- FinePDFs-Edu: HuggingFaceFW/finepdfs-edu
- Wikipedia: Wikimedia Foundation
提供机构:
alexliap



