five

Dataset for training classifiers of comparative sentences

收藏
Mendeley Data2024-03-27 更新2024-06-27 收录
下载链接:
https://zenodo.org/record/3237552
下载链接
链接失效反馈
官方服务:
资源简介:
As there was no large publicly available cross-domain dataset for comparative argument mining, we create one composed of sentences, potentially annotated with BETTER / WORSE markers (the first object is better / worse than the second object) or NONE (the sentence does not contain a comparison of the target objects). The BETTER sentences stand for a pro-argument in favor of the first compared object and WORSE-sentences represent a con-argument and favor the second object. We aim for minimizing dataset domain-specific biases in order to capture the nature of comparison and not the nature of the particular domains, thus decided to control the specificity of domains by the selection of comparison targets. We hypothesized and could confirm in preliminary experiments that comparison targets usually have a common hypernym (i.e., are instances of the same class), which we utilized for selection of the compared objects pairs. The most specific domain we choose, is computer science with comparison targets like programming languages, database products and technology standards such as Bluetooth or Ethernet. Many computer science concepts can be compared objectively (e.g., on transmission speed or suitability for certain applications). The objects for this domain were manually extracted from List of-articles at Wikipedia. In the annotation process, annotators were asked to only label sentences from this domain if they had some basic knowledge in computer science. The second, broader domain is brands. It contains objects of different types (e.g., cars, electronics, and food). As brands are present in everyday life, anyone should be able to label the majority of sentences containing well-known brands such as Coca-Cola or Mercedes. Again, targets for this domain were manually extracted from `List of''-articles at Wikipedia. The third domain is not restricted to any topic: random. For each of 24~randomly selected seed words 10 similar words were collected based on the distributional similarity API of JoBimText (http://www.jobimtext.org). Seed words created using randomlists.com: book, car, carpenter, cellphone, Christmas, coffee, cork, Florida, hamster, hiking, Hoover, Metallica, NBC, Netflix, ninja, pencil, salad, soccer, Starbucks, sword, Tolkien, wine, wood, XBox, Yale. Especially for brands and computer science, the resulting object lists were large (4493 in brands and 1339 in computer science). In a manual inspection, low-frequency and ambiguous objects were removed from all object lists (e.g., RAID (a hardware concept) and Unity (a game engine) are also regularly used nouns). The remaining objects were combined to pairs. For each object type (seed Wikipedia list page or the seed word), all possible combinations were created. These pairs were then used to find sentences containing both objects. The aforementioned approaches to selecting compared objects pairs tend minimize inclusion of the domain specific data, but do not solve the problem fully though. We keep open a question of extending dataset with diverse object pairs including abstract concepts for future work. As for the sentence mining, we used the publicly available index of dependency-parsed sentences from the Common Crawl corpus containing over 14 billion English sentences filtered for duplicates. This index was queried for sentences containing both objects of each pair. For 90% of the pairs, we also added comparative cue words (better, easier, faster, nicer, wiser, cooler, decent, safer, superior, solid, terrific, worse, harder, slower, poorly, uglier, poorer, lousy, nastier, inferior, mediocre) to the query in order to bias the selection towards comparisons but at the same time admit comparisons that do not contain any of the anticipated cues. This was necessary as a random sampling would have resulted in only a very tiny fraction of comparisons. Note that even sentences containing a cue word do not necessarily express a comparison between the desired targets (dog vs. cat: He's the best pet that you can get, better than a dog or cat.). It is thus especially crucial to enable a classifier to learn not to rely on the existence of clue words only (very likely in a random sample of sentences with very few comparisons). For our corpus, we keep pairs with at least 100 retrieved sentences. From all sentences of those pairs, 2500 for each category were randomly sampled as candidates for a crowdsourced annotation that we conducted on figure-eight.com in several small batches. Each sentence was annotated by at least five trusted workers. We ranked annotations by confidence, which is the figure-eight internal measure of combining annotator trust and voting, and discarded annotations with a confidence below 50%. Of all annotated items, 71% received unanimous votes and for over 85% at least 4 out of 5 workers agreed -- rendering the collection procedure aimed at ease of annotation successful. The final dataset contains 7199 sentences with 271 distinct object pairs. The majority of sentences (over 72%) are non-comparative despite biasing the selection with cue words; in 70% of the comparative sentences, the favored target is named first. You can browse though the data here: https://docs.google.com/spreadsheets/d/1U8i6EU9GUKmHdPnfwXEuBxi0h3aiRCLPRC-3c9ROiOE/edit?usp=sharing Full description of the dataset is available in the workshop paper at ACL 2019 conference. Please cite this paper if you use the data: Franzek, Mirco, Alexander Panchenko, and Chris Biemann. "Categorization of Comparative Sentences for Argument Mining." arXiv preprint arXiv:1809.06152 (2018). @inproceedings{franzek2018categorization, title={Categorization of Comparative Sentences for Argument Mining}, author={Panchenko, Alexander and Bondarenko, and Franzek, Mirco and Hagen, Matthias and Biemann, Chris}, booktitle={Proceedings of the 6th Workshop on Argument Mining at ACL'2019}, year={2019}, address={Florence, Italy} }

由于目前尚无公开可用的大规模跨域比较论证挖掘(comparative argument mining)数据集,我们构建了一个由句子组成的数据集,这些句子可被标注为BETTER、WORSE或NONE三类:其中BETTER表示句子支持第一个比较对象的正面论证,WORSE表示句子支持第二个比较对象的反面论证,NONE则表示句子未包含目标对象的比较关系。 BETTER类型句子代表支持第一个比较对象的正面论证,WORSE类型句子则代表支持第二个比较对象的反面论证。为尽可能降低数据集的领域特定偏见,以捕捉比较关系的本质而非特定领域的特性,我们通过选择比较对象来控制领域的特异性。 我们提出并在初步实验中验证了这一假设:比较对象通常拥有共同的上位词(hypernym),即属于同一类别的实例,我们正是利用这一特性来选择比较对象对。我们选择的最具体的领域是计算机科学,其比较对象包括编程语言、数据库产品以及蓝牙(Bluetooth)、以太网(Ethernet)等技术标准。许多计算机科学概念可进行客观比较,例如基于传输速度或特定应用的适用性。该领域的对象均从维基百科的列表类词条中手动提取。 在标注流程中,标注人员仅被允许标注该领域内的句子,前提是其具备计算机科学基础知识。第二个覆盖范围更广的领域是品牌,其中包含多种类型的对象,例如汽车、电子产品与食品。由于品牌存在于日常生活中,任何人都能够标注包含可口可乐(Coca-Cola)、梅赛德斯-奔驰(Mercedes)等知名品牌的绝大多数句子。同样,该领域的对象也从维基百科的列表类词条中手动提取。第三个领域无特定主题限制,即随机领域。我们从randomlists.com选取的24个种子词中,每个词基于JoBimText的分布相似性API(http://www.jobimtext.org)收集了10个相似词。所用种子词为:book、car、carpenter、cellphone、Christmas、coffee、cork、Florida、hamster、hiking、Hoover、Metallica、NBC、Netflix、ninja、pencil、salad、soccer、Starbucks、sword、Tolkien、wine、wood、XBox、Yale。 就计算机科学与品牌这两个领域而言,最终得到的对象列表规模较大:品牌领域包含4493个对象,计算机科学领域包含1339个对象。经人工检查,所有对象列表中均剔除了低频且歧义的对象,例如RAID既可指硬件概念,也可作为日常名词使用;Unity既可指游戏引擎,也可作为普通名词。剩余对象被组合为对象对。针对每类对象,即维基百科列表类词条或种子词,我们生成了所有可能的组合。随后利用这些对象对检索同时包含两个目标对象的句子。 前述选择比较对象对的方法虽可尽可能降低领域特定数据的占比,但并未完全解决该问题。我们将在未来工作中考虑通过加入包含抽象概念的多样化对象对来扩展该数据集。 在句子检索环节,我们使用了公开可用的依存句法分析(dependency-parsed)句子索引,该索引来自包含超140亿条英语句子且已去重的Common Crawl语料库。我们针对每个对象对检索同时包含两个目标对象的句子。针对90%的对象对,我们还在检索词中加入了比较提示词:better、easier、faster、nicer、wiser、cooler、decent、safer、superior、solid、terrific、worse、harder、slower、poorly、uglier、poorer、lousy、nastier、inferior、mediocre,以将检索结果偏向比较类句子,但同时也保留未包含上述预期提示词的比较句。这一步骤是必要的,因为随机采样仅能得到极少的比较句。 需注意的是,即便句子包含提示词,也未必表达了目标对象之间的比较关系,例如句子"He's the best pet that you can get, better than a dog or cat."并未在句中比较狗与猫。因此,让分类器学会不仅依赖提示词的存在尤为关键,在仅包含极少比较句的随机句子样本中,这种情况尤为明显。 针对语料库,我们仅保留至少检索到100条句子的对象对。从这些对象对对应的所有句子中,我们按类别随机抽取了2500条句子作为众包标注的候选样本,该众包标注任务在figure-eight.com平台上分多批次完成。每个句子均由至少5名可信标注人员进行标注。我们采用标注置信度,即figure-eight平台用于结合标注者可信度与投票结果的内部指标,对标注结果进行排序,并剔除置信度低于50%的标注。在所有已标注样本中,71%的样本获得了完全一致的投票结果,超过85%的样本至少有4/5的标注人员达成共识——这表明我们设计的简化标注流程取得了成功。 最终数据集包含7199条句子,对应271个不同的对象对。尽管我们通过提示词偏向性检索,但绝大多数句子(超过72%)仍不属于比较句;在70%的比较句中,受支持的对象均位于句首。 您可通过以下链接浏览数据集:https://docs.google.com/spreadsheets/d/1U8i6EU9GUKmHdPnfwXEuBxi0h3aiRCLPRC-3c9ROiOE/edit?usp=sharing 该数据集的完整描述可在ACL 2019会议的论证挖掘研讨会论文中查阅。若您使用该数据集,请引用以下论文: Franzek, Mirco, Alexander Panchenko, and Chris Biemann. "Categorization of Comparative Sentences for Argument Mining." arXiv preprint arXiv:1809.06152 (2018). 对应的BibTeX格式引用如下: @inproceedings{franzek2018categorization, title={Categorization of Comparative Sentences for Argument Mining}, author={Panchenko, Alexander and Bondarenko, and Franzek, Mirco and Hagen, Matthias and Biemann, Chris}, booktitle={Proceedings of the 6th Workshop on Argument Mining at ACL'2019}, year={2019}, address={Florence, Italy} }
创建时间:
2023-06-28
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作