tip-of-my-tongue-known-item-search
收藏TOMT-KIS (tip-of-my-tongue-known-item-search) 数据集
概述
- 名称: TOMT-KIS (tip-of-my-tongue-known-item-search)
- 语言: 英语
- 标签: 信息检索, TREC, tip-of-my-tongue, known-item-search, 自然语言处理, 信息检索
- 大小: 1M<n<10M
- 许可证: Apache 2.0
描述
TOMT-KIS 是一个大规模的已知项目问题数据集,包含来自 r/tipofmytongue 子版块的 128 万个已知项目问题。该数据集用于 QPP++@ECIR23 论文中的已知项目问题性能预测研究。
引用
如果使用 TOMT-KIS 数据集,请引用以下论文:
@InProceedings{froebe:2023c, author = {Maik Fr{"o}be and Eric Oliver Schmidt and Matthias Hagen}, booktitle = {QPP++ 2023: Query Performance Prediction and Its Evaluation in New Tasks}, month = apr, publisher = {CEUR-WS.org}, series = {CEUR Workshop Proceedings}, site = {Dublin, Irland}, title = {{A Large-Scale Dataset for Known-Item Question Performance Prediction}}, year = 2023 }
数据结构
TOMT-KIS 数据集以 JSONL 格式提供。每个问题包含所有爬取的数据属性,并在我们的启发式方法能够提取答案时添加所选答案。
数据实例
jsonl { "id": "2gbnla", "author": "alany611", "url": "http://www.reddit.com/r/tipofmytongue/comments/2gbnla/tomt_1990s_educational_cartoon_for_kids_to_learn/", "permalink": "/r/tipofmytongue/comments/2gbnla/tomt_1990s_educational_cartoon_for_kids_to_learn/", "title": "[TOMT] 1990s Educational Cartoon for kids to learn French", "content": "Hi all,
When I was really young, 3-5, I remember watching a cartoon that I think was supposed to teach kids French. I would guess it was made from 1990-1995, but possibly earlier.
It was in color and the episodes I remember featured a guy with a long, narrow, and crooked nose and greenish skin teaching kids how to count? There was also a scene that had some character running up a clock tower to change the time.
Overall, it was a pretty gloomy feel, iirc, and Id love to see it again if possible.", "created_utc": "1410647042", "link_flair_text": "Solved", "comments": [ { "author": "scarpoochi", "body": "Muzzy?
https://www.youtube.com/watch?v=mD9i39GENWU", "created_utc": "1410649099", "score": 11, "comments": [ { "author": "alany611", "body": "thank you!!!", "created_utc": "1410666273", "score": 1 } ] }, { "author": "pepitica", "body": "Muzzy! Its been driving me crazy for a while now!", "created_utc": "1410649896", "score": 6 } ], "answer_detected": True, "solved_utc": "1410649099", "chosen_answer": "Muzzy?
https://www.youtube.com/watch?v=mD9i39GENWU", "links_on_answer_path": [ "https://www.youtube.com/watch?v=mD9i39GENWU" ] }
数据字段
TOMT-KIS 包含每个问题的 128 个属性,例如:
id(int): 问题的唯一 Reddit 标识符title(string): 问题的标题content(string): 问题的主体文本内容created_utc(date): 问题的发布时间戳link_flair_text(string): 指示问题是否已解决;由版主设置comments(string, json): 每个问题的完整讨论树
对于由版主标记为已解决的问题,我们运行了一个面向精度的答案识别启发式方法,并在启发式方法能够识别答案时添加了四个“新”属性:
answer_detected(boolean): 指示我们的启发式方法是否能够提取答案solved_utc(date): 识别答案的发布时间戳chosen_answer(string): 提取的答案links_on_answer_path(list of strings): 包含在问题和答案帖子之间找到的所有 Reddit 外部页面的链接




