The Authorship of Stephen King's Books Written Under the Pseudonym "Richard Bachman": A Stylometric Analysis (data)

NIAID Data Ecosystem2026-05-01 收录

下载链接：

https://zenodo.org/record/7956048

下载链接

链接失效反馈

官方服务：

资源简介：

This data accompanies a paper for the 2nd Annual Conference for Computational Literary Studies: "The Authorship of Stephen King’s Books Written Under the Pseudonym 'Richard Bachman': A Stylometric Analysis". Abstract: Between 1977 and 1984, Stephen King published five novels under the pseudonym “Richard Bachman”. Reviewers noted similarities between King’s and Bachman’s writing styles when Thinner (1984) was published, ultimately leading to King’s unmasking. We investigate, using the Juola protocol, whether computational techniques can correctly identify King as the author of the Bachman books out of a selection of contemporary candidate authors – Dean Koontz, Peter Straub, and Thomas Harris. We also perform a post-hoc analysis of the use of pop-culture references and brand names in Bachman, King, Koontz, Straub, and Harris novels, based on comments in reviews of Bachman and King novels. The references extracted from the Bachman books occurred significantly more often in King’s texts than in the others’, showing that attentive readers could have “heard King’s voice” in the Bachman books through what a reviewer denigratingly called King’s “compulsion to list brand-name products and his affinity for pop-cult teenage junk”. These results contribute to the vexed issue of explainability, which is a recurrent challenge in author identification for literary texts. Below is a description of each file in this repository: bachman_segments_features_array_1000token_segments.csv, bachman_segments_features_array_5000token_segments.csv, and bachman_segments_features_array_10000token_segments.csv contain the feature spaces created by vectorizing 1,000-, 5,000-, and 10,000-token segments of Bachman, King, Koontz, and Straub books. Each row of the csv files contains the vectorized segment, the segment's author, the book the segment was drawn from, the book's publication date, and the segment number. bachman_segments_author_candidate_cosine_distances_1000token_segments.csv, bachman_segments_author_candidate_cosine_distances_5000token_segments.csv,and bachman_segments_author_candidate_cosine_distances_10000token_segments.csv contain the Bachman segment number, bootstrap iteration number (from 0 and 9,999), the distractor author of the randomly-sampled segment, and the cosine distance between the Bachman segment vector and the distractor author's randomly-sampled segment vector (calculated using the data stored in the bachman_segments_features_array_1000token_segments.csv, bachman_segments_features_array_5000token_segments.csv, and bachman_segments_features_array_10000token_segments.csv files). bachman_segments_author_candidate_ranks_1000token_segments.csv, bachman_segments_author_candidate_ranks_5000token_segments.csv,and bachman_segments_author_candidate_ranks_10000token_segments.csv contain the same columns as the 3 files described in the previous paragraph, but the cosine distance between Bachman segment and distractor author segment is converted to a ranking. For each bootstrap iteration there are 4 (one for each candidate author) rows containing the distance ranking between the Bachman segment and a candidate author segment. In a particular bootstrap iteration, if a King segment had the smallest cosine distance to a Bachman segment, King has the ranking "1", and if a Koontz segment had the second smallest distance to a Bachman segment, Koontz has the ranking "2", and so on. predicted_author_candidate_raw_counts_1000token_segments.csv, predicted_author_candidate_raw_counts_5000token_segments.csv, and predicted_author_candidate_raw_counts_10000token_segments.csv contain the total number of times King, Straub, Harris, and Koontz segments received a certain distance ranking in the files described in the previous paragraph. predicted_author_candidate_proportions_1000token_segments.csv, predicted_author_candidate_proportions_5000token_segments.csv, and and predicted_author_candidate_proportions_10000token_segments.csv contain a Bachman book title, and percentage of that book's segments that received the distance rankings 1-4 of each author. For example, in predicted_author_candidate_proportions_10000token_segments.csv, The Long Walk's segments were ranked as most similar (rank= "1") to King segments in 73.3% of bootstrap iterations. pop_culture_refs_counts_books_10000token_segments.csv contains the author and book title of a randomly-sampled 10,000-token segment from the aforementioned book, the iteration (from 0 to 99), and the number of pop culture references found in the segment that match those extracted from Bachman books.

本数据集配套发表于第二届计算文学研究年度会议的论文：《以笔名“理查德·巴克曼（Richard Bachman）”出版的斯蒂芬·金作品的作者归属：一项文体计量分析》。摘要： 1977年至1984年间，斯蒂芬·金以笔名“理查德·巴克曼（Richard Bachman）”出版了五部小说。1984年《瘦人（Thinner）》出版时，评论者注意到金与巴克曼的写作风格存在相似之处，最终导致金的身份暴露。本研究采用朱拉协议（Juola protocol），探究计算技术能否从当代候选作者——迪恩·孔茨（Dean Koontz）、彼得·斯特劳布（Peter Straub）与托马斯·哈里斯（Thomas Harris）——中正确识别出巴克曼作品的真实作者为金。此外，本研究基于巴克曼与金作品的评论内容，对巴克曼、金、孔茨、斯特劳布及哈里斯的小说中流行文化引用与品牌名称的使用情况开展事后分析。从巴克曼作品中提取的引用在金的文本中出现的频率显著高于其他作者的作品，这表明细心的读者本可通过一位评论者贬称的金“罗列品牌产品的执念与对流行文化青少年糟粕的偏好”，在巴克曼的作品中“听出金的声音”。本研究结果为可解释性这一棘手议题提供了补充，而可解释性是文学文本作者识别任务中长期存在的挑战。以下为本仓库中各文件的详细说明： `bachman_segments_features_array_1000token_segments.csv`、`bachman_segments_features_array_5000token_segments.csv`与`bachman_segments_features_array_10000token_segments.csv`包含针对巴克曼、斯蒂芬·金、迪恩·孔茨、彼得·斯特劳布的作品按1000 Token、5000 Token及10000 Token分段后向量化生成的特征空间。每个CSV文件的每一行均包含向量化后的文本分段、该分段的作者、分段来源书籍、书籍出版日期以及分段编号。 `bachman_segments_author_candidate_cosine_distances_1000token_segments.csv`、`bachman_segments_author_candidate_cosine_distances_5000token_segments.csv`与`bachman_segments_author_candidate_cosine_distances_10000token_segments.csv`包含巴克曼文本分段编号、自举迭代（bootstrap iteration）次数（取值范围为0至9999）、随机采样分段的对照作者，以及巴克曼分段向量与该对照作者随机采样分段向量之间的余弦距离（计算所用数据存储于`bachman_segments_features_array_1000token_segments.csv`、`bachman_segments_features_array_5000token_segments.csv`与`bachman_segments_features_array_10000token_segments.csv`文件中）。 `bachman_segments_author_candidate_ranks_1000token_segments.csv`、`bachman_segments_author_candidate_ranks_5000token_segments.csv`与`bachman_segments_author_candidate_ranks_10000token_segments.csv`的列结构与前述三个文件一致，但将巴克曼分段与对照作者分段之间的余弦距离转换为排名。每一次自举迭代对应4行数据（每位候选作者各一行），记录巴克曼分段与该候选作者分段之间的距离排名。例如，在某次自举迭代中，若斯蒂芬·金的分段与某巴克曼分段的余弦距离最小，则金的排名为"1"；若迪恩·孔茨的分段与该巴克曼分段的余弦距离次之，则孔茨的排名为"2"，以此类推。 `predicted_author_candidate_raw_counts_1000token_segments.csv`、`predicted_author_candidate_raw_counts_5000token_segments.csv`与`predicted_author_candidate_raw_counts_10000token_segments.csv`记录了斯蒂芬·金、彼得·斯特劳布、托马斯·哈里斯与迪恩·孔茨的分段在前述文件中获得特定距离排名的总次数。 `predicted_author_candidate_proportions_1000token_segments.csv`、`predicted_author_candidate_proportions_5000token_segments.csv`与`predicted_author_candidate_proportions_10000token_segments.csv`包含巴克曼作品的书名，以及该作品各分段获得每位作者1至4名距离排名的占比。例如，在`predicted_author_candidate_proportions_10000token_segments.csv`中，《长路漫漫（The Long Walk）》的分段在73.3%的自举迭代中被评为与斯蒂芬·金的分段相似度最高（排名为"1"）。 `pop_culture_refs_counts_books_10000token_segments.csv`包含来自前述作品的随机采样10000 Token分段的作者与书名、迭代次数（取值范围为0至99），以及该分段中与从巴克曼作品中提取的流行文化引用相匹配的数量。

创建时间：

2023-10-02