five

Cognitive Grammar Sample Language Extracts (Sentences)

收藏
DataONE2021-05-24 更新2024-06-08 收录
下载链接:
https://search.dataone.org/view/sha256:0f52647d1d5796ed1a70db48c3f030c785eb00381db18e3cdc35c0d43f4b9985
下载链接
链接失效反馈
官方服务:
资源简介:
This data set accompanies the published article \"Age-Related Hearing Loss, Speech Understanding and Cognitive Technologies\" by Joseph Lehmann, Nathaniel Christen, Yechiel Michael, and Israel Gannot (International Journal of Speech Technology), as well as several essays by Nathaniel Christen which analyze the collection of language examples comprising the data set from the perspective of Cognitive Grammar. The original published article can be found online at https://link.springer.com/article/10.1007/s10772-021-09817-z The complete data set and code base is on github at https://github.com/scignscape/ntxh/tree/ctg (this is the branch where the current data set is hosted, rather than master) As discussed in the published article, the authors use this data set to illustrate theoretical and technological concerns preliminary to the goal of designing an integrated language for describing various facets of linguistic analysis from the perspective of Cognitive Grammar, including lexical, syntactic, morphological, prosodic, discursive, and epistemic/conceptual structures which might be identified within the patterns, processing, or rules/conventions manifest in language samples. The data set display the samples in several formats, among them a new markup language for annotations and metadata which covers different aspects of linguistic description, including corpus metadata, prosody, and parse-graphs. Accompanying code documents how the samples' markup is parsed and transformed to sample-collections as well as article text. This specific markup format is experimental; the larger point is to represent the different requirements for code implementing a general-purpose annotation and metadata language for Cognitive Grammar. Although individual theorists working in the overall Cognitive Linguistics tradition have developed their own systems for annotating linguistic structures or parsing/semantic data, as well as for diagramming cognitive processes or dispositions judged to be activated by and/or manifest in processing specific language examples, we are not aware of an attempt to systematically design a common language of linguistic/cognitive structures that could be used to annotate Cognitive Grammar corpora. In particular, we believe it would be useful to construct corpora from Cognitive-Linguistic research by aggregating language examples used as case-studies in Cognitive-Linguistic publications. This data set illustrates the kind of software ecosystem which could faciltiate the implementation of such a research program, including code to extract language samples from publications constructed via special-purpose markup languages (one whose parsers and document-generators are included in an accompanying code repository); code to generate machine-readable representations of publication text along with metadata identifying textual features such as sentence boundaries, and a custom PDF viewer to leverage this metadata for UI features; code to aggregate linguistic samples into reusable copora or data sets; and software tools to implement data-set applications where these corpora can be examined in stand-alone fashion. Code provided with the data set compiles to a \"data-set application\" where the samples can be examined interactively. The data-set code also includes a number of utility scripts covering different parts of the document-generation and data-aggregation pipeline. For this current data set, a collection of over 500 language samples, some drawn from existing corpora or literature, is encoded in both machine-readable and human-readable formats. Readers can informally browse the samples in Markdown format from a page within the accompanying github repository (see the github archive for the specific links); or can view the samples along with analyses in essays included as part of the data-set; or can download and compile the standalone data-set application (with a code base also includes a custom PDF viewer, document generator, and other tools). This data set comprises sample sentences (or other discourse units, including expressions and multi-sentence conversation turns) obtained from several sources relaetd to cognitive and/or computational linguistics. Most of the sentences were chosen to illustrate theoretical claims or issues in the area of Cognitive Grammar. The actual compilation of the data set in its final form is achieved by preprocessing several essays typeset as LaTeX documents where the samples are included as part of the essay text. The samples are presented as numbered examples in the essays in PDF form (these files are included in the data set) and separately stored as structured data in LaTeX auxiliary files, where they are processed to construct data structures in a special format for loading in a custom application where these samples can be browsed as standalone data (all of the code for the various processing steps are included in a github repository linked from the data set). In addition to the langauge samples, the data set includes machine-readable representations of the documents where the samples are analyzed, including metadata marking discourse features such as sentence and paragraph boundaries. The accompanying code repository includes a custom PDF viewer equipped with procedures that extract this metadata (embedded in the PDF files) and use it for certain UI features, such as automatically copying a sentence via a context menu (without having to select the sentence as a character range). The code also includes features for representing syntax and/or prosody in different formats found in Cognitive Linguistic or related literature, such as Dependency Grammar. As discussed in the accompanying published article, this code could potentially be used as part of a system for notating semantic frames, parse structures, prosody/intonation, and other descriptions and diagrammatic tools targeted at Cognitive Grammar. The authors share these examples and code libraries (much of the code is a work-in-progress) in the hopes, in part, of demonstrating possible ideas for an emerging coding/software ecosystem supporting Cognitive Linguistics, and particularly Cognitive Grammar. Examples of tools which may be useful in this context would be corpus-curation software, particularly in the context of language examples introduced (as is common in pragmatics and cognitive linguistics) as case-studies for linguistic analysis and/or graphical representation. Because many publications include numbered lists of language examples which are specifically chosen (often invented) to demonstrate theoretical claims or topics, such examples could provide a useful corpus were they to be merged into a common format and aggregaetd from multiple publications. The current code and data set includes procedures which implement a pipeline providing this kind of functionality, and it could potentially be used as a basis or prototype for future, larger-scale corpora. The authors also hope that the techniques for constructing machine-readable publication texts and metadata, and leveraging this metadata via custom PDF viewers, might inspire similar efforts to publish more rigorously machine-readable versions of academic publications (without relying on imperfect NLP methods or \"PDF scraping\"). The code base provides a new document-encoding format inspired by (though syntactically different from) TagML, the \"Text as Graph Markup Language\", developed at Huygens ING in the Netherlands. The new language (internally called \"GTagML\", for \"grounded\" TagML) is parsed via C++ and can be extended via C++ callbacks; the code base includes a series of such callbacks which are responsible for detecting features such as sentence boundaries and creating metadata files. The \"GTagML\" generator then outputs LaTeX code which includes additional LaTeX commands that yield further metadata via auxiliary files. Other C++ executables then read and merge all the metadata files; the resulting integrated metadata is output in several formats, including one which divides metadata entries based on the page where their corresponding document entities appear, resulting in a series of files corresponding to each document page; these files are zipped into a package along with machine-readable document-text encoding, with the final zipped file embedded back in the PDF document files. The custom PDF viewer, which can be build from C++ source code, can then extract and unzip those files, reading the metadata to incorporate raw document text (bypassing the PDF text system) and metadata on sentence and paragraph boundaries (including PDF coordinates). Separately, the merged metadata is also compiled into the standalone data set read by the custom data-set application (which can also be compiled from C++ code included with the data set). To enable potential users to just browse the samples without having to compile any code, the data set is also generated in a Markdown form that can be browsed from the github repository. The data-set code provides a multi-stage generation pipeline which covers both document-preparation and data-aggregation concerns. As mentioned previously, the explanatory essays included with the data set are produced in a machine-readable format as part of a workflow which also includes logic to parse language samples from within these documents (yielding the core data set of approximately 500 samples). Many steps and components are needed to finalize the documents and files output via this workflow, and the data-set code fully exposes all the software used for this process. Utility scripts (which are explained further in the github repository) can be used to automate much of this workflow. This full set of tools is provided partly to generate the current data set but also to serve as a prototype, case-study, or provisional code toward a multi-faceted \"Cognitive Grammar\" software/digital ecosystem. To get an overview of the data set, casual users can browse the samples directly in Markdown format, and if desired can read the essays where the samples are analyzed (the essays are included in PDF format as part of the data set and code repository). To view the data set within the custom data-set application, and/or to examing the special features of the custom PDF viewer (which is based on Glyph & Cog's XPDF implementation), users can clone the github repository and branch where the code is currently hosted. The code depends on Qt (which is a popular and large-scale application-development framework that is free for non-commercial use). There are few additional dependencies, although the PDF components may require commonly-used libraries related to typefaces and graphics (some such libraries are included in Qt distributious, but users may have to choose whether to use the Qt versions or their native system versions). With that said, a lot of the code is experimental and included for demonstration purposes. In other words, we want to illustrate the kind of code base which would be necessary to establish a \"software ecosystem\" for cognitive grammar (CG), included capabilities such as (1) typesetting CG publications with effective styles (e.g. LaTeX packages) for language examples, for explanatory diagrams (of the kind popularized by Langacker, for instance), for syntactic or semantic representations (e.g. parse graphs), for prosodic markup incorporating CG notations (as discussed in the accompanying published article), and so forth; (2) extracting language examples from such publications (and others suitably annotation) to build CG corpora; (3) creating machine-readable representations of publication text (including PDF coordinates for sentence boundaries and other discursively significant document locations); (4) developing a language for notating lexical, syntactic, morphological, prosodic, and/or discursive structures/patterns in language examples according to CG perspectives, along with parsers and data models for such a proposed CG language (analogous perhaps to GF in the Grammatical Framework, to cite an ambitious example); and (5) software facilitating CG research in conjunction with various methods of computational linguistics (e.g. corpus curation, NLP tools, speech transcription applications, etc.).
创建时间:
2023-11-14
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作