five

Corpus of daily jokes from the 24ur.com portal Šale24 1.0

收藏
hdl.handle.net2025-01-08 收录
下载链接:
http://hdl.handle.net/11356/1945
下载链接
链接失效反馈
官方服务:
资源简介:
This is a corpus of 1915 "jokes of the day" ("šala dneva") published by the Slovenian news portal 24ur.com. The jokes were scraped from their archive on September 18th, 2024. The initial list is lightly curated: shorter texts found in the original collection were removed from the corpus since they appear to be illustration captions without the accompanying illustrations. Readers of the news portal vote on the jokes themselves with thumbs up and thumbs down buttons. The voting results are included as metadata with each joke. Several jokes have been published more than once. Each joke (distinguished based on exact text matches) is identified by a hash of its text and presents a list of voting results for every instance of its publication. The normalised_text field contains text with punctuation corrections. For now, this is limited to replacing '' (two consecutive apostrophes U+0027) with " (a single straight/dumb/vertical quotation mark U+0022). The former (two apostrophes) is consistently used in place of the latter in the original corpus. Based on the name ("Šala dneva" i.e. "Joke of the day") and observed frequency of posting during September 2024 we assume each entry corresponds to a day starting from the day of data collection counting backwards. Each voting event for has an associated estimated publication date calculated with the above algorithm. The jokes are linguistically annotated with CLASSLA-Stanza (https://github.com/clarinsi/classla), using the models for standard Slovenian. The JSONL file contains entries representing individual jokes containing: - a hash of the original joke text used for duplicate identification (key: hash) - original scraped text (key: original_text) - normalised text (key: normalised_text) - linguistically annotated normalised text in CoNLL-U format (key: processed_text) - a list of vote objects containing joke vote metadata (key: votes) - votes for (key: votes.for) - votes against (key: votes.against) - estimated dates of joke publication and voting (key: estimated_date) The corpus contains 16658 sentences, 129063 tokens, and 662 recognised named entities.

本数据集为由斯洛文尼亚新闻门户网站24ur.com发布的1915篇“每日笑话”(“šala dneva”)构成。这些笑话于2024年9月18日从其存档中爬取。初步列表经过轻度筛选:由于这些简短文本在原始集合中似乎仅作为插图说明,未伴随插图,因此已被从语料库中移除。新闻门户网站的读者通过点赞和踩不赞同的方式来对笑话进行投票,投票结果作为每个笑话的元数据包含在内。一些笑话被多次发布。每个笑话(根据精确文本匹配来区分)通过其文本的哈希值进行标识,并列出其每次发布时的投票结果列表。normalized_text字段包含经过标点符号修正的文本,目前仅限于将'(两个连续的撇号U+0027)替换为"(单个直/愚/垂直引号U+0022)。在原始语料库中,前者(两个撇号)一直被用于代替后者。根据名称(“Šala dneva”,即“每日笑话”)以及2024年9月期间观察到的发布频率,我们假设每条条目对应从数据收集日开始往回数的每一天。每个投票事件都与一个估计的笑话发布日期相关联,该日期使用上述算法计算得出。笑话使用CLASSLA-Stanza(https://github.com/clarinsi/classla)进行语言学标注,采用标准斯洛文尼亚语模型。JSONL文件包含代表单个笑话的条目,包括:- 用于重复识别的原始笑话文本的哈希值(键:hash)- 原始抓取文本(键:original_text)- 标准化文本(键:normalised_text)- 以CoNLL-U格式处理的语言学标注的标准化文本(键:processed_text)- 包含笑话投票元数据的投票对象列表(键:votes)- 赞同票数(键:votes.for)- 反对票数(键:votes.against)- 笑话发布和投票的估计日期(键:estimated_date)该语料库包含16658个句子、129063个标记和662个识别出的命名实体。
提供机构:
hdl.handle.net
二维码
社区交流群
二维码
科研交流群
商业服务