ankushthakurr09/whiteswan_agentic_a1F
收藏Hugging Face2026-03-29 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/ankushthakurr09/whiteswan_agentic_a1F
下载链接
链接失效反馈官方服务:
资源简介:
---
license: apache-2.0
task_categories:
- table-question-answering
language:
- en
tags:
- agentic_ai
- llms
- gemma3
- instruct_models
size_categories:
- 100K<n<1M
---
# Dataset Token Filter
This repository/folder contains a Python script (`filter_dataset.py`) for filtering JSONL datasets by their token context lengths.
## Files
- `filter_dataset.py`: The script to process and filter the dataset.
- `whiteswan_agentic_129k_sft.jsonl`: The original dataset containing 129k entries.
- `whiteswan_filtered_under_4400.jsonl`: The output filtered dataset containing only the samples shorter than 4,400 tokens.
## How it Works
The script iterates through a `.jsonl` file line by line. It parses each line as JSON, extracts the `content` field recursively from the `messages` list, and calculates the exact amount of tokens present using OpenAI's `tiktoken` library (with the `cl100k_base` encoding algorithm, which is equivalent to the GPT-4 tokenizer).
If the total sum of tokens across all messages in that sample is **less than 4,400 tokens**, the entire original JSON object is written to the new output `.jsonl` dataset.
## Setup Requirements
Before running the script, the only library dependency required is `tiktoken`, which the script attempts to install automatically on the first run.
If you somehow need to install it manually:
```bash
pip install tiktoken
```
## Running the Script
Simply execute the Python script:
```bash
python3 filter_dataset.py
```
The script will track its progress every 10,000 lines and print the final counts of processed over kept samples when completed.
提供机构:
ankushthakurr09



