GroNLP/ik-nlp-22_slp
收藏数据集卡片 for IK-NLP-22 语音与语言处理
数据集描述
数据集摘要
该数据集包含从Jurafsky和Martin的《语音与语言处理》书籍(2022年1月第三版草稿)中提取的章节,通过半自动程序(见下文详细信息)。此外,还提供了与每个章节相关的一小部分概念性问题及其可能的答案。
仅提供了书籍草稿的第2至第11章的内容,因为这些内容与2022年信息科学硕士学位(IK)在格罗宁根大学的自然语言处理课程相关,该课程由Arianna Bisazza教授,Gabriele Sarti协助。
语言
《语音与语言处理》的语言数据为英语(BCP-47 en)。
数据集结构
数据实例
数据集包含两种配置:paragraphs(默认),包含与相应章节和部分关联的完整解析段落集;questions,包含与相关段落匹配的一小部分示例问题及其答案跨度。
Paragraphs 配置
paragraphs 配置包含所选书籍章节的所有段落,每个段落都与相应的章节、部分和子部分关联。以下是 paragraphs 配置的 train 分割中的一个示例。该示例属于第2.3节,但不属于子部分,因此 n_subsection 和 subsection 字段为空字符串。
json { "n_chapter": "2", "chapter": "Regular Expressions", "n_section": "2.3", "section": "Corpora", "n_subsection": "", "subsection": "", "text": "Its also quite common for speakers or writers to use multiple languages in a single communicative act, a phenomenon called code switching. Code switching (2.2) Por primera vez veo a @username actually being hateful! it was beautiful:)" }
文本按原样提供,未经进一步预处理或标记化。
Questions 配置
questions 配置包含一小部分问题、与问题相关的顶部检索段落和答案跨度。以下是 questions 配置的 test 分割中的一个示例。
json { "chapter": "Regular Expressions", "section": "Regular Expressions", "subsection": "Basic Regular Expressions", "question": "What is the meaning of the Kleene star in Regex?", "paragraph": "This language consists of strings with a b, followed by at least two as, followed by an exclamation point. The set of operators that allows us to say things like "some number of as" are based on the asterisk or , commonly called the Kleene * (gen-Kleene * erally pronounced "cleany star"). The Kleene star means "zero or more occurrences of the immediately previous character or regular expression". So /a/ means "any string of zero or more as". This will match a or aaaaaa, but it will also match Off Minor since the string Off Minor has zero as. So the regular expression for matching one or more a is /aa*/, meaning one a followed by zero or more as. More complex patterns can also be repeated. So /[ab]*/ means "zero or more as or bs" (not "zero or more right square braces"). This will match strings like aaaa or ababab or bbbb.", "answer": "The Kleene star means "zero or more occurrences of the immediately previous character or regular expression"" }
数据分割
| config | train | test |
|---|---|---|
paragraphs |
1697 | - |
questions |
- | 59 |
数据集创建
《语音与语言处理》书籍PDF的内容使用AllenAI的PDF to S2ORC JSON Converter提取。转换器提取的文本随后手动清理,以删除章节末尾的练习和其他无关内容(例如表格、TikZ图等)。解析内容中的一些问题在最终版本中保留,以保持自然主义设置,促进学生使用数据过滤启发式方法。
问题-答案对由Gabriele Sarti手动创建。
附加信息
数据集策展人
如有问题,请联系我们:ik-nlp-course@rug.nl。
许可信息
请参考作者的网站获取许可信息。
引用信息
如果您在工作中使用这些语料库,请引用作者:
bibtex @book{slp3ed-iknlp2022, author = {Jurafsky, Daniel and Martin, James}, year = {2021}, month = {12}, pages = {1--235, 1--19}, title = {Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition}, volume = {3} }



