yoonholee/style-eval-corpus
收藏Hugging Face2026-04-17 更新2026-04-26 收录
下载链接:
https://hf-mirror.com/datasets/yoonholee/style-eval-corpus
下载链接
链接失效反馈官方服务:
资源简介:
---
license: cc0-1.0
language:
- en
tags:
- style
- writing
- evaluation
- inverse-reward-design
- authorship
size_categories:
- 10K<n<100K
pretty_name: Style Eval Corpus (11 Writers)
---
# Style Eval Corpus
Writing from 11 internet writers with instantly recognizable but distinct styles. Built for style-as-reward-inference experiments: given a writer's body of work, infer their implicit reward function and encode it as eval components (rubrics, classifiers, probes).
## Contents
| Author | Pieces | Words | Register | Source |
|---|---|---|---|---|
| Paul Graham | 229 | 562,990 | Contrarian startup essays | [paulgraham.com](http://paulgraham.com/articles.html) |
| Andrej Karpathy | 34 | 101,337 | Tutorial-as-thinking-aloud | [karpathy.github.io](http://karpathy.github.io/) + [bearblog](https://karpathy.bearblog.dev/) |
| Gwern Branwen | 220 | 2,800,357 | Quantitative empiricist | [gwern.net](https://gwern.net/) (CC-0) |
| dril | 7,494 | 150,557 | Absurdist authority | [crumb/dril-tweets](https://huggingface.co/datasets/crumb/dril-tweets) |
| Donald Trump | 46,683 | 924,881 | Superlative combative | [fschlatt/trump-tweets](https://huggingface.co/datasets/fschlatt/trump-tweets) |
| Derek Sivers | 546 | 231,786 | Zen-minimalist aphorisms | [sive.rs](https://sive.rs/) |
| Maciej Ceglowski | 353 | 524,614 | Sardonic literary tech critic | [idlewords.com](https://idlewords.com/) |
| Scott Alexander | 137 | 560,874 | Dense calibrated reasoning | [astralcodexten.com](https://www.astralcodexten.com/) |
| Naval Ravikant | 52 | 46,538 | Compressed aphoristic wisdom | [navalmanack.com](https://www.navalmanack.com/) (CC) |
| Joel Spolsky | 209 | 123,984 | Narrative tech management | [joelonsoftware.com](https://www.joelonsoftware.com/) |
| Eliezer Yudkowsky | 1,496 | 4,753,270 | Rationalist pedagogy | [lesswrong.com](https://www.lesswrong.com/) |
| **Total** | **57,453** | **10,781,188** | | |
## Design
The corpus spans several deliberate contrasts:
- **Long-form analytical** (PG, Karpathy, Gwern, Scott Alexander, Ceglowski, Joel, Eliezer) vs **short-form** (dril, Trump, Sivers, Naval). Same eval framework, different registers.
- **Same genre, different rewards**: PG / Karpathy / Gwern / Scott Alexander / Eliezer all write tech/rationalist essays but with radically different implicit reward functions. Within-genre discrimination is the hard task.
- **Corpus size varies deliberately**: Karpathy (34 posts) and Naval (52 chapters) are the few-shot setting. Trump (46K tweets) and Eliezer (1.5K posts) are data-rich. Tests generalization from thin vs thick reference sets.
## Schema
| Column | Type | Description |
|---|---|---|
| `author` | string | Writer name |
| `title` | string or null | Title (null for tweets) |
| `url` | string | Source URL |
| `text` | string | Full text (markdown for essays, plain for tweets) |
| `word_count` | int | Word count |
## License
- **Gwern Branwen**: CC-0 (public domain).
- **Naval Ravikant**: Almanack is CC-licensed, explicitly free.
- **Eliezer Yudkowsky**: LessWrong content is CC BY 4.0.
- **dril / Trump tweets**: sourced from existing HF datasets.
- **All others**: publicly posted writing, research use under fair use.
Curation/parsing released under CC0.
提供机构:
yoonholee



