five

yoonholee/style-eval-corpus

收藏
Hugging Face2026-04-17 更新2026-04-26 收录
下载链接:
https://hf-mirror.com/datasets/yoonholee/style-eval-corpus
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: cc0-1.0 language: - en tags: - style - writing - evaluation - inverse-reward-design - authorship size_categories: - 10K<n<100K pretty_name: Style Eval Corpus (11 Writers) --- # Style Eval Corpus Writing from 11 internet writers with instantly recognizable but distinct styles. Built for style-as-reward-inference experiments: given a writer's body of work, infer their implicit reward function and encode it as eval components (rubrics, classifiers, probes). ## Contents | Author | Pieces | Words | Register | Source | |---|---|---|---|---| | Paul Graham | 229 | 562,990 | Contrarian startup essays | [paulgraham.com](http://paulgraham.com/articles.html) | | Andrej Karpathy | 34 | 101,337 | Tutorial-as-thinking-aloud | [karpathy.github.io](http://karpathy.github.io/) + [bearblog](https://karpathy.bearblog.dev/) | | Gwern Branwen | 220 | 2,800,357 | Quantitative empiricist | [gwern.net](https://gwern.net/) (CC-0) | | dril | 7,494 | 150,557 | Absurdist authority | [crumb/dril-tweets](https://huggingface.co/datasets/crumb/dril-tweets) | | Donald Trump | 46,683 | 924,881 | Superlative combative | [fschlatt/trump-tweets](https://huggingface.co/datasets/fschlatt/trump-tweets) | | Derek Sivers | 546 | 231,786 | Zen-minimalist aphorisms | [sive.rs](https://sive.rs/) | | Maciej Ceglowski | 353 | 524,614 | Sardonic literary tech critic | [idlewords.com](https://idlewords.com/) | | Scott Alexander | 137 | 560,874 | Dense calibrated reasoning | [astralcodexten.com](https://www.astralcodexten.com/) | | Naval Ravikant | 52 | 46,538 | Compressed aphoristic wisdom | [navalmanack.com](https://www.navalmanack.com/) (CC) | | Joel Spolsky | 209 | 123,984 | Narrative tech management | [joelonsoftware.com](https://www.joelonsoftware.com/) | | Eliezer Yudkowsky | 1,496 | 4,753,270 | Rationalist pedagogy | [lesswrong.com](https://www.lesswrong.com/) | | **Total** | **57,453** | **10,781,188** | | | ## Design The corpus spans several deliberate contrasts: - **Long-form analytical** (PG, Karpathy, Gwern, Scott Alexander, Ceglowski, Joel, Eliezer) vs **short-form** (dril, Trump, Sivers, Naval). Same eval framework, different registers. - **Same genre, different rewards**: PG / Karpathy / Gwern / Scott Alexander / Eliezer all write tech/rationalist essays but with radically different implicit reward functions. Within-genre discrimination is the hard task. - **Corpus size varies deliberately**: Karpathy (34 posts) and Naval (52 chapters) are the few-shot setting. Trump (46K tweets) and Eliezer (1.5K posts) are data-rich. Tests generalization from thin vs thick reference sets. ## Schema | Column | Type | Description | |---|---|---| | `author` | string | Writer name | | `title` | string or null | Title (null for tweets) | | `url` | string | Source URL | | `text` | string | Full text (markdown for essays, plain for tweets) | | `word_count` | int | Word count | ## License - **Gwern Branwen**: CC-0 (public domain). - **Naval Ravikant**: Almanack is CC-licensed, explicitly free. - **Eliezer Yudkowsky**: LessWrong content is CC BY 4.0. - **dril / Trump tweets**: sourced from existing HF datasets. - **All others**: publicly posted writing, research use under fair use. Curation/parsing released under CC0.
提供机构:
yoonholee
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作