Replication data for: Salience-simplification strategy for markedness of causal subordinators: “because” and “since” in argumentative essays
收藏DataONE2026-01-05 更新2026-01-17 收录
下载链接:
https://search.dataone.org/view/sha256:252561e3ec9dbd2d9b1e558a88904b466c3fcdc9524a4f580a494705dc06efec
下载链接
链接失效反馈官方服务:
资源简介:
The dataset supports the research article \"Salience-simplification strategy to markedness of causal subordinators: The case of “because” and “since” in argumentative essays\". In total, the dataset marks features of 976 causal adverbial subordinations retrieved from student argumentative essays.Data points were extracted from three corpora. Specifically, all essays in NESSIE (Native English Speakers’ Similarly or Identically-prompted Essays, created by Xu Jiajin, 781 essays; 291,911 tokens) and argumentative essays in LOCNESS (the Louvain Corpus of Native English Essays, created by Granger, 323 essays; 230,138 tokens) were selected. Native argumentative essays from BAWE’s (British Academic Written English, created by Hilary Nesi) Arts and Humanities disciplinary group were chosen (512 essays; 1,360,932 tokens). In total, 1,616 essays comprising 1,882,981 tokens were examined. The dataset comprises 976 datapoints of causal subordinations conjoined by \"because\" and \"since\" in students' argumentative essays--488 data points of all \"since\" subordinations, and 488 randomly selected \"because\" subordinations. On these data points, ten contextual features that are potential predictors of people's choices between causal subordinators \"because\" and \"since\" were annotated. The ten contextual features annotated are \"position\", \"separation\", \"embeddedness\", \"initial adverbials\", \"sub-clause\", \"de-ranking\", \"clause-length ratio\", \"hedging terms\", \"clausal relationship\", and \"bridging\". Overall fourteen variables including ten contetual features are annotated: (1) \"No.\" is the ID of each data point(this is one ID marker); (2) \"subordinator\" marks the logical subordinators (this categorical variable has two values: \"because\" and \"since\"); (3) \"position\" marks the logical adverbial clause positions compared with the main clause (this categorical variable has two values: \"preposed\" or \"postposed\"); (4) \"sep\" indicates whether a separating punctuation mark exists between the subordinate and main clauses(this categorical variable has two values: \"YES\" or \"NO\"); (5) \"embeddedness\" indicates whether a complex sentence is embedded in a larger comlex sentence(this categorical variable has two values: \"YES\" or \"NO\"); (6) \"ini.adv\" denotes whether an initial adverbial exists in the causal subordination(this categorical variable has two values: \"YES\" or \"NO\"); (7) \"sub-clau\" indicates whether the causal subordinate contains sub-clauses of any type(this categorical variable has two values: \"YES\" or \"NO\"); (8) \"deranking\" indicates whether the predicate of the subordinate clause is complete(this categorical variable has two values: \"YES\" or \"NO\"); (9) \"sub.main.ratio\" is the length ratio of the subordinate and main clauses in terms of word count (this numerical variable is converted into ln value for better interpretation); (10) \"hedging\" indicates whether a hedging term exists in the subordinate clause(this categorical variable has two values: \"YES\" or \"NO\"); (11) \"clau.rel\" denotes the interclausal relationships on the general level(this categorical variable has two values: \"direct\" or \"indirect\"); (12) \"spc.clau.rel2\" denotes the interclausal relationships on the secondary level(this categorical variable has five values: \"im\", \"rm\", \"asst\", \"inpr\", and \"sugg\"); (13) \"bridging\" indicates whether the subordinate clause contains any information referring back to the preceding clause(this categorical variable has two values: \"YES\" or \"NO\"); (14) \"source\" shows specific corpora the data points come from (this categorical variable has three values: \"NESSIE\", \"LOCNESS\", or \"BAWE\") ; This dataset was constructed to explore contextual features that discriminate between causal subordinators of \"because\" and \"since\" and to rank the effective features.
本数据集支撑研究论文《因果从属连词(causal subordinators)标记性的显著性简化策略:以议论文中的"because"与"since"为例》。本数据集共计标注了从学生议论文中提取的976条因果状语从属结构(causal adverbial subordinations)特征。数据点来源于三个语料库:分别为徐佳津构建的NESSIE(Native English Speakers’ Similarly or Identically-prompted Essays,母语英语者同提示或等同提示议论文语料库,含781篇议论文,共计291,911个词元(Token))的全部议论文;Granger构建的LOCNESS(the Louvain Corpus of Native English Essays,卢万母语英语议论文语料库,含323篇议论文,共计230,138个词元(Token))中的议论文文本;以及由Hilary Nesi构建的BAWE(British Academic Written English,英国学术书面英语语料库)艺术与人文学科组的母语议论文本(512篇,共计1,360,932个词元(Token))。本次研究共计分析1,616篇议论文,总词数达1,882,981个词元(Token)。本数据集共包含976条由"because"与"since"连接的学生议论文因果从属结构数据点:其中488条为"since"从属结构,另有488条随机选取的"because"从属结构。针对上述数据点,研究人员已标注10项潜在可预测人们选择因果从属连词"because"与"since"的语境特征。这10项语境特征分别为:位置(position)、分隔性(separation)、嵌套性(embeddedness)、初始状语(initial adverbials)、子从句(sub-clause)、降阶性(deranking)、从句长度比(clause-length ratio)、模糊限制语(hedging terms)、从句间关系(clausal relationship)与衔接性(bridging)。本次标注共计涵盖14项变量,其中包含上述10项语境特征:(1) "No.":每条数据点的唯一标识(ID)标记;(2) "subordinator":标记所用逻辑从属连词(该分类变量包含两个取值:"because"与"since");(3) "position":标记因果状语从句相对于主句的位置(该分类变量包含两个取值:"前置(preposed)"与"后置(postposed)");(4) "sep":指示从属分句与主句之间是否存在分隔标点(该分类变量包含两个取值:"是(YES)"与"否(NO)");(5) "embeddedness":指示复合句是否嵌套于更大的复合句中(该分类变量包含两个取值:"是(YES)"与"否(NO)");(6) "ini.adv":指示因果从属结构中是否存在初始状语(该分类变量包含两个取值:"是(YES)"与"否(NO)");(7) "sub-clau":指示因果从属分句是否包含任意类型的子从句(该分类变量包含两个取值:"是(YES)"与"否(NO)");(8) "deranking":指示从属分句的谓语是否完整(该分类变量包含两个取值:"是(YES)"与"否(NO)");(9) "sub.main.ratio":以词数统计的从属分句与主句的长度比值(为便于解读,该数值变量已转换为自然对数(ln)值);(10) "hedging":指示从属分句中是否存在模糊限制语(该分类变量包含两个取值:"是(YES)"与"否(NO)");(11) "clau.rel":表征从句间的宏观层面关系(该分类变量包含两个取值:"直接(direct)"与"间接(indirect)");(12) "spc.clau.rel2":表征从句间的次级层面关系(该分类变量包含五个取值:"im"、"rm"、"asst"、"inpr"与"sugg");(13) "bridging":指示从属分句是否包含指代前文分句的信息(该分类变量包含两个取值:"是(YES)"与"否(NO)");(14) "source":展示数据点所属的具体语料库(该分类变量包含三个取值:"NESSIE"、"LOCNESS"与"BAWE")。本数据集的构建旨在探究区分"because"与"since"两类因果从属连词的语境特征,并对有效特征进行排序。
创建时间:
2026-01-06



