Game Walkthrough Corpus (GWTC)
收藏Mendeley Data2024-05-10 更新2024-06-27 收录
下载链接:
https://zenodo.org/records/4562336
下载链接
链接失效反馈官方服务:
资源简介:
Motivation The Game Walkthrough Corpus (GWTC) contains 12,295 unique walkthrough documents that cover a total of 6,117 games. For each game walkthrough, it provides frequencies of unigrams and bigrams, treating it as a bag of words. In addition, it provides word frequencies on the sentence level. Furthermore, the GWTC contains a number of game-related metadata, including title, publisher, developer, year, genre, etc. All the language statistics and metadata are stored in separate plain text files and can be referenced by means of uniform resource names (URN). These URNs also can be used to derive any combination of statistics and metadata. Researchers, for instance, can investigate the most frequent unigrams for games in the “Adventure” genre. This way, the GWTC can be reused in various ways, for different kinds of research questions on the topic of gaming language, which may be summarized as “distant playing”. Copyright Information Game walkthroughs are protected by individual copyright notices that are often very strict. That is why this data set does not include the documents but instead various data formats that are useful for text mining and distant reading methods while not allowing to recreate the documents. It is highly unlikely that even a single sentence can be reconstructed from the published data. Since the documents are not -- not even in part -- published but only text mining statistics about them, no violation of copyright is done by this project. Links to the original documents are available in the sourceUrls file in the data folder. File Information data folder: document data bagofwords: Word frequencies per document bigrams: Bigram frequencies per document corpusstats: Min, avg and max token count, type count, type/token ratio, documents per game plus corressponding standard deviation game_walkthrough_mapping: Documents per game game_walkthrough_mapping: Number of documents per game sentencecollocations: Word frequencies per sentence per document sourceUrls: Links to original text textlength: Number of characters per document tfidf_deu: Word significance per document (German) ifidf_eng: Word significance per document (English) tokencount: Number of unique words per document typecount: Number of words per document metadata: game metadata file names that do not start with "_": metadata [filename] per game _all: All metadata in one file _mapping_release_date*: Metadata combined with release data for time series doc folder: documentation createdata: Python script to create content of data folder extractMetainformation: Python script to create content of metadata folder metadata_rawg: Game metadata collected from RAWG metadata_steam: Game metadata collected from Steam metadata_symbol: Quality control. Relation of text in source HTML and extracted text titlesandurns: Game titles mapped to project identifiers Walkthrough Sources https://portforward.com/games/walkthroughs/ https://www.neoseeker.com https://www.spieletipps.de https://jayisgames.com/ http://gamesetter.com/ Corpus Statistics Number of unique games: 6,013 Number of documents: 12,295 Genre associations: 3,806 Gameplay tags: 10,246 Release dates: 2,443 Developers: 3,152 Publishers: 2,782 Steam IDs: 1,086 Platform associations: 5,293 (PC, Gameboy, iOS, Linux,...) Game language associations: 4,631 Languages: English, German and a little bit of French External Resources Project Website: https://www.informatik.uni-leipzig.de/~jtiepmar/forschung/gwtc/ Bitbucket: https://bitbucket.org/jtiepmar/the-game-walkthrough-corpus/src/master/ There are two version of the GWTC available for download: ver. 0.99 contains all the above corpus files, plus the Git files. Note that after downloading ver. 0.99, the Git folders may be hidden per default, depending on you operating system. Ver. 1.0 is a cleaned up version that comes without the Git files.
研究背景与动机
游戏攻略语料库(Game Walkthrough Corpus, 简称GWTC)收录12295份独立游戏攻略文档,覆盖总计6117款游戏。针对每份游戏攻略,该语料库以词袋模型(bag-of-words model)为处理框架,提供一元词(unigram)与二元词(bigram)的词频统计结果;此外还包含句子级别的词频统计数据。不仅如此,GWTC还涵盖多类游戏相关元数据,包括游戏标题、发行商、开发商、发行年份、游戏类型等。所有语言统计数据与元数据均存储于独立的纯文本文件中,可通过统一资源名称(Uniform Resource Name, 简称URN)进行引用;此类URN还支持组合生成任意类型的统计与元数据子集。例如,研究者可针对“冒险(Adventure)”类游戏的高频一元词开展分析。借此,GWTC可被复用于各类与游戏语言相关的研究场景,此类研究可被概括为“远程游玩(distant playing)”方向。
版权声明
游戏攻略受各自独立且通常极为严格的版权条款保护。因此,本数据集未直接收录原始攻略文档,仅提供适用于文本挖掘与远程阅读研究的多种数据格式,且无法通过这些数据重构原始攻略内容。即便仅尝试重构单句文本,其可行性也极低。由于本项目仅发布相关文本挖掘统计数据,而非(哪怕是部分)原始文档,因此不会构成版权侵权行为。原始文档的链接可在数据文件夹内的sourceUrls文件中获取。
文件说明
data文件夹:
bagofwords:单文档词频统计文件
bigrams:单文档二元词频统计文件
corpusstats:词数、词型数、词型-词型比、单游戏对应文档数的最小值、平均值与最大值,以及对应标准差的统计文件
game_walkthrough_mapping:单游戏对应文档列表文件
game_walkthrough_mapping:单游戏对应文档数量统计文件
sentencecollocations:单文档内单句子的词频统计文件
sourceUrls:原始文本链接文件
textlength:单文档字符数统计文件
tfidf_deu:单文档词汇重要性评分文件(德语)
tfidf_eng:单文档词汇重要性评分文件(英语)
tokencount:单文档唯一词数统计文件
typecount:单文档总词数统计文件
metadata:游戏元数据文件。文件名不以“_”开头的文件为单游戏元数据文件;_all:整合为单个文件的全部元数据;_mapping_release_date*:结合发行日期的元数据,用于时序分析
doc文件夹:
document data:文档资料目录
createdata:用于生成data文件夹内容的Python脚本
extractMetainformation:用于生成metadata文件夹内容的Python脚本
metadata_rawg:从RAWG平台采集的游戏元数据
metadata_steam:从Steam平台采集的游戏元数据
metadata_symbol:质量控制项:源HTML文本与提取文本的对应关系
titlesandurns:游戏标题与项目标识符的映射表
攻略来源
https://portforward.com/games/walkthroughs/
https://www.neoseeker.com
https://www.spieletipps.de
https://jayisgames.com/
http://gamesetter.com/
语料库统计信息
唯一游戏数量:6013
文档总数:12295
类型关联数:3806
游戏玩法标签数:10246
发行日期条目数:2443
开发商数量:3152
发行商数量:2782
Steam标识符数量:1086
平台关联数:5293(涵盖PC、Gameboy、iOS、Linux等)
游戏语言关联数:4631
涉及语言:英语、德语及少量法语
外部资源
项目官网:https://www.informatik.uni-leipzig.de/~jtiepmar/forschung/gwtc/
Bitbucket仓库:https://bitbucket.org/jtiepmar/the-game-walkthrough-corpus/src/master/
本语料库提供两个可下载版本:0.99版包含上述全部语料文件及Git相关文件。请注意,下载0.99版后,Git文件夹可能因操作系统默认设置而处于隐藏状态。1.0版为精简版本,未包含Git相关文件。
创建时间:
2023-06-28



