five

AdaMLLab/WebTerminal

收藏
Hugging Face2026-02-14 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/AdaMLLab/WebTerminal
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: apache-2.0 task_categories: - text-generation language: - en tags: - agent - coding - terminal - shell - pretrain - pretraining - agentic - llm - web - fineweb - dclm pretty_name: WebTerminal size_categories: - 1M<n<10M configs: - config_name: clean data_files: - split: train path: clean/*.parquet default: true - config_name: unfiltered data_files: - split: train path: unfiltered/*.parquet --- # Terminal/CLI Web Text ![webterminal](webterminal.png) A filtered extract of terminal and command-line content from two large web-text corpora, designed for upsampling agentic-adjacent data during pretraining. ## Subsets | Subset | Rows | Tokens | Size | Quality | |---|---|---|---|---| | **`clean`** (default) | 2.33M | 4.6B | 11 GB | ~98% terminal content | | `unfiltered` | 61.3M | 359B | 962 GB | ~15% terminal content | ```python from datasets import load_dataset # Load the clean subset (default) ds = load_dataset("AdaMLLab/WebTerminal") # Load the unfiltered subset ds = load_dataset("AdaMLLab/WebTerminal", "unfiltered") ``` ## Sources - **DCLM** (`Zyphra/dclm-dedup`) - **FineWeb** (`Salesforce/fineweb_deduplicated`) ## How it was built ### v0.1 Unfiltered 1. **Fast filter**: skip any document that doesn't contain obvious CLI indicators (`$`, `sudo`, `pip install`, `` ```bash ``, `root@`, etc.) 2. **Score**: remaining docs are scored (0-34) across five signals, each with a per-match point value and a cap: | Filter | Description | Points | Cap | |---|---|---|---| | Prompt patterns | Shell prompts like `$ cmd`, `user@host:~$`, `>>>`, `root@`, `PS C:\` | 2 per match | 10 | | CLI commands | Known commands: `sudo`, `apt-get`, `pip install`, `git clone`, `docker run`, `curl`, `ssh`, `gcc`, etc. (30+ patterns) | 1 per unique match | 8 | | stdout patterns | Output indicators: "successfully installed", "cloning into", `drwx` (ls output), "packets transmitted", "traceback", version strings | 2 per match | 6 | | Code blocks | Terminal-flavored code blocks: `` ```bash ``, `` ```shell ``, `<pre><code>`, terminal/console div classes | 2 per match | 6 | | Indented blocks | 3+ consecutive lines indented 4+ spaces (code/output blocks) | 1 per match | 4 | Documents scoring >=5 are kept. 3. **Dedup**: exact dedup across both datasets using xxhash64 on full text. Removed 1,168 duplicates. ### v0.2 Clean The unfiltered subset is ~84-86% noise at lower score levels (5-12), which make up 93% of the data. The root cause: v0.1's scoring uses context-blind keyword matching, CLI command names like `find`, `make`, `cat` appear in normal English prose, bare `$` matches currency amounts, and indented Python/SQL code gets scored as terminal content. v0.2 applies a three-stage structural filter over the unfiltered data: 1. **Context-aware**: instead of matching bare `$`, requires `$ sudo`, `$ git`, `$ docker`, etc. (dollar sign + space + known command). Eliminates ~87% of documents immediately. 2. **Validation regex**: confirms a genuine structural terminal pattern exists, shell prompts followed by real commands, `user@host:~$` patterns, Python REPL `>>>`, tracebacks, `` ```bash `` code blocks, Unix file permission listings, man page headers, shebangs. 3. **Weighted structural scoring** (`term_score_v2`): each pattern has a weight (1-3) and occurrences are capped. Documents need `term_score_v2 >= 3` to be kept. | Weight | Signal | Max | |---|---|---| | 3 | Command prompts (`$ cmd` at line start) | 9 | | 3 | SSH prompts (`user@host:~$`) | 9 | | 2 | Python REPL, file listings, tracebacks, terminal code blocks, git/docker ops, Windows prompts, man pages | 2-6 each | | 1 | Install output, systemd units, shebangs, sudo commands | 1 each | No indentation-based scoring. No context-blind command substring matching. **Result**: 3.8% of the unfiltered data survives, from 61.3M rows down to 2.33M rows. Quality jumps from ~15% to ~98% terminal/CLI content. ## Schema ### Clean subset | Column | Type | Description | |---|---|---| | `text` | string | Document text | | `term_score` | int32 | Original v0.1 score (5-34) | | `term_score_v2` | int32 | Structural score from v0.2 filter (3+) | ### Unfiltered subset | Column | Type | Description | |---|---|---| | `text` | string | Document text | | `term_score` | int32 | Original v0.1 score (5-34) | ## Stats ### Clean (v0.2) - **2,334,414 rows** | **4.6B tokens** (Llama-3.2-1B tokenizer) | **11 GB** - 62 parquet files, ~169-185 MB each, snappy compressed ### Unfiltered (v0.1) - **61,341,278 rows** | **359B tokens** | **962 GB** - 4,187 parquet files, ~180-240 MB each, snappy compressed | v0.1 Score | Count | % | |---|---|---| | 5 | 39,025,201 | 63.62% | | 6 | 10,787,199 | 17.59% | | 7 | 4,063,886 | 6.63% | | 8 | 2,911,983 | 4.75% | | 9-14 | 3,594,547 | 5.86% | | 15-34 | 958,462 | 1.56% | ## Use case Upsampling agentic-adjacent data during pretraining. The `clean` subset is recommended for most use cases. The `unfiltered` subset is available for researchers who want to apply their own filtering.
提供机构:
AdaMLLab
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作