ankushthakurr09/whiteswan_agentic_a1F

Name: ankushthakurr09/whiteswan_agentic_a1F
Creator: ankushthakurr09
Published: 2026-03-29 12:33:16
License: 暂无描述

Hugging Face2026-03-29 更新2026-04-12 收录

下载链接：

https://hf-mirror.com/datasets/ankushthakurr09/whiteswan_agentic_a1F

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: apache-2.0 task_categories: - table-question-answering language: - en tags: - agentic_ai - llms - gemma3 - instruct_models size_categories: - 100K<n<1M --- # Dataset Token Filter This repository/folder contains a Python script (`filter_dataset.py`) for filtering JSONL datasets by their token context lengths. ## Files - `filter_dataset.py`: The script to process and filter the dataset. - `whiteswan_agentic_129k_sft.jsonl`: The original dataset containing 129k entries. - `whiteswan_filtered_under_4400.jsonl`: The output filtered dataset containing only the samples shorter than 4,400 tokens. ## How it Works The script iterates through a `.jsonl` file line by line. It parses each line as JSON, extracts the `content` field recursively from the `messages` list, and calculates the exact amount of tokens present using OpenAI's `tiktoken` library (with the `cl100k_base` encoding algorithm, which is equivalent to the GPT-4 tokenizer). If the total sum of tokens across all messages in that sample is **less than 4,400 tokens**, the entire original JSON object is written to the new output `.jsonl` dataset. ## Setup Requirements Before running the script, the only library dependency required is `tiktoken`, which the script attempts to install automatically on the first run. If you somehow need to install it manually: ```bash pip install tiktoken ``` ## Running the Script Simply execute the Python script: ```bash python3 filter_dataset.py ``` The script will track its progress every 10,000 lines and print the final counts of processed over kept samples when completed.

提供机构：

ankushthakurr09

5,000+

优质数据集

54 个

任务类型

进入经典数据集