five

aviaefrat/cryptonite

收藏
Hugging Face2024-01-18 更新2024-05-25 收录
下载链接:
https://hf-mirror.com/datasets/aviaefrat/cryptonite
下载链接
链接失效反馈
官方服务:
资源简介:
--- annotations_creators: - expert-generated language_creators: - expert-generated language: - en license: - cc-by-nc-4.0 multilinguality: - monolingual size_categories: - 100K<n<1M - 1K<n<10K source_datasets: - original task_categories: - question-answering task_ids: - open-domain-qa paperswithcode_id: null pretty_name: Cryptonite dataset_info: - config_name: default features: - name: agent_info sequence: - name: Bottomline dtype: string - name: Role dtype: string - name: Target dtype: float32 - name: agent_turn sequence: int32 - name: dialogue_acts sequence: - name: intent dtype: string - name: price dtype: float32 - name: utterance sequence: string - name: items sequence: - name: Category dtype: string - name: Images dtype: string - name: Price dtype: float32 - name: Description dtype: string - name: Title dtype: string splits: - name: train num_bytes: 8538836 num_examples: 5247 - name: test num_bytes: 1353933 num_examples: 838 - name: validation num_bytes: 966032 num_examples: 597 download_size: 25373618 dataset_size: 10858801 - config_name: cryptonite features: - name: clue dtype: string - name: answer dtype: string - name: enumeration dtype: string - name: publisher dtype: string - name: date dtype: int64 - name: quick dtype: bool - name: id dtype: string splits: - name: train num_bytes: 52228597 num_examples: 470804 - name: validation num_bytes: 2901768 num_examples: 26156 - name: test num_bytes: 2908275 num_examples: 26157 download_size: 21615952 dataset_size: 58038640 config_names: - cryptonite - default --- # Dataset Card for Cryptonite ## Table of Contents - [Dataset Description](#dataset-description) - [Dataset Summary](#dataset-summary) - [Supported Tasks and Leaderboards](#supported-tasks-and-leaderboards) - [Languages](#languages) - [Dataset Structure](#dataset-structure) - [Data Instances](#data-instances) - [Data Fields](#data-fields) - [Data Splits](#data-splits) - [Dataset Creation](#dataset-creation) - [Curation Rationale](#curation-rationale) - [Source Data](#source-data) - [Annotations](#annotations) - [Personal and Sensitive Information](#personal-and-sensitive-information) - [Considerations for Using the Data](#considerations-for-using-the-data) - [Social Impact of Dataset](#social-impact-of-dataset) - [Discussion of Biases](#discussion-of-biases) - [Other Known Limitations](#other-known-limitations) - [Additional Information](#additional-information) - [Dataset Curators](#dataset-curators) - [Licensing Information](#licensing-information) - [Citation Information](#citation-information) - [Contributions](#contributions) ## Dataset Description - **Homepage:** [Github](https://github.com/aviaefrat/cryptonite) - **Repository:** [Github](https://github.com/aviaefrat/cryptonite) - **Paper:** [Arxiv](https://arxiv.org/pdf/2103.01242.pdf) - **Leaderboard:** - **Point of Contact:** [Twitter](https://twitter.com/AviaEfrat) ### Dataset Summary Current NLP datasets targeting ambiguity can be solved by a native speaker with relative ease. We present Cryptonite, a large-scale dataset based on cryptic crosswords, which is both linguistically complex and naturally sourced. Each example in Cryptonite is a cryptic clue, a short phrase or sentence with a misleading surface reading, whose solving requires disambiguating semantic, syntactic, and phonetic wordplays, as well as world knowledge. Cryptic clues pose a challenge even for experienced solvers, though top-tier experts can solve them with almost 100% accuracy. Cryptonite is a challenging task for current models; fine-tuning T5-Large on 470k cryptic clues achieves only 7.6% accuracy, on par with the accuracy of a rule-based clue solver (8.6%). ### Languages English ## Dataset Structure ### Data Instances This is one example from the train set. ```python { 'clue': 'make progress socially in stated region (5)', 'answer': 'climb', 'date': 971654400000, 'enumeration': '(5)', 'id': 'Times-31523-6across', 'publisher': 'Times', 'quick': False } ``` ### Data Fields - `clue`: a string representing the clue provided for the crossword - `answer`: a string representing the answer to the clue - `enumeration`: a string representing the - `publisher`: a string representing the publisher of the crossword - `date`: a int64 representing the UNIX timestamp of the date of publication of the crossword - `quick`: a bool representing whether the crossword is quick (a crossword aimed at beginners, easier to solve) - `id`: a string to uniquely identify a given example in the dataset ### Data Splits Train (470,804 examples), validation (26,156 examples), test (26,157 examples). ## Dataset Creation ### Curation Rationale Crosswords from the Times and the Telegraph. ### Source Data #### Initial Data Collection and Normalization [More Information Needed] #### Who are the source language producers? [More Information Needed] ### Annotations #### Annotation process [More Information Needed] #### Who are the annotators? [More Information Needed] ### Personal and Sensitive Information [More Information Needed] ## Considerations for Using the Data ### Social Impact of Dataset [More Information Needed] ### Discussion of Biases [More Information Needed] ### Other Known Limitations [More Information Needed] ## Additional Information ### Dataset Curators Avia Efrat, Uri Shaham, Dan Kilman, Omer Levy ### Licensing Information `cc-by-nc-4.0` ### Citation Information ``` @misc{efrat2021cryptonite, title={Cryptonite: A Cryptic Crossword Benchmark for Extreme Ambiguity in Language}, author={Avia Efrat and Uri Shaham and Dan Kilman and Omer Levy}, year={2021}, eprint={2103.01242}, archivePrefix={arXiv}, primaryClass={cs.CL} } ``` ### Contributions Thanks to [@theo-m](https://github.com/theo-m) for adding this dataset.
提供机构:
aviaefrat
原始信息汇总

数据集概述

数据集名称

  • 名称: Cryptonite

数据集特征

  • 语言: 英语
  • 许可证: cc-by-nc-4.0
  • 多语言性: 单语种
  • 大小:
    • 100K<n<1M
    • 1K<n<10K
  • 源数据集: 原始数据
  • 任务类别: 问答
  • 任务ID: open-domain-qa

数据集结构

  • 配置名称: default 和 cryptonite
  • 特征:
    • default配置:
      • agent_info: 包含 Bottomline (字符串), Role (字符串), Target (浮点数)
      • agent_turn: 整数
      • dialogue_acts: 包含 intent (字符串), price (浮点数)
      • utterance: 字符串
      • items: 包含 Category (字符串), Images (字符串), Price (浮点数), Description (字符串), Title (字符串)
    • cryptonite配置:
      • clue: 字符串
      • answer: 字符串
      • enumeration: 字符串
      • publisher: 字符串
      • date: 整数 (UNIX时间戳)
      • quick: 布尔值
      • id: 字符串
  • 数据分割:
    • default配置:
      • train: 5247个样本, 8538836字节
      • test: 838个样本, 1353933字节
      • validation: 597个样本, 966032字节
    • cryptonite配置:
      • train: 470804个样本, 52228597字节
      • validation: 26156个样本, 2901768字节
      • test: 26157个样本, 2908275字节

数据集创建

  • 注释创建者: 专家生成
  • 语言创建者: 专家生成

许可证信息

  • 许可证: cc-by-nc-4.0

引用信息

@misc{efrat2021cryptonite, title={Cryptonite: A Cryptic Crossword Benchmark for Extreme Ambiguity in Language}, author={Avia Efrat and Uri Shaham and Dan Kilman and Omer Levy}, year={2021}, eprint={2103.01242}, archivePrefix={arXiv}, primaryClass={cs.CL} }

5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作