five

ankushthakurr09/whiteswan_agentic_a1F

收藏
Hugging Face2026-03-29 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/ankushthakurr09/whiteswan_agentic_a1F
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: apache-2.0 task_categories: - table-question-answering language: - en tags: - agentic_ai - llms - gemma3 - instruct_models size_categories: - 100K<n<1M --- # Dataset Token Filter This repository/folder contains a Python script (`filter_dataset.py`) for filtering JSONL datasets by their token context lengths. ## Files - `filter_dataset.py`: The script to process and filter the dataset. - `whiteswan_agentic_129k_sft.jsonl`: The original dataset containing 129k entries. - `whiteswan_filtered_under_4400.jsonl`: The output filtered dataset containing only the samples shorter than 4,400 tokens. ## How it Works The script iterates through a `.jsonl` file line by line. It parses each line as JSON, extracts the `content` field recursively from the `messages` list, and calculates the exact amount of tokens present using OpenAI's `tiktoken` library (with the `cl100k_base` encoding algorithm, which is equivalent to the GPT-4 tokenizer). If the total sum of tokens across all messages in that sample is **less than 4,400 tokens**, the entire original JSON object is written to the new output `.jsonl` dataset. ## Setup Requirements Before running the script, the only library dependency required is `tiktoken`, which the script attempts to install automatically on the first run. If you somehow need to install it manually: ```bash pip install tiktoken ``` ## Running the Script Simply execute the Python script: ```bash python3 filter_dataset.py ``` The script will track its progress every 10,000 lines and print the final counts of processed over kept samples when completed.
提供机构:
ankushthakurr09
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作