aviaefrat/cryptonite
收藏Hugging Face2024-01-18 更新2024-05-25 收录
下载链接:
https://hf-mirror.com/datasets/aviaefrat/cryptonite
下载链接
链接失效反馈官方服务:
资源简介:
---
annotations_creators:
- expert-generated
language_creators:
- expert-generated
language:
- en
license:
- cc-by-nc-4.0
multilinguality:
- monolingual
size_categories:
- 100K<n<1M
- 1K<n<10K
source_datasets:
- original
task_categories:
- question-answering
task_ids:
- open-domain-qa
paperswithcode_id: null
pretty_name: Cryptonite
dataset_info:
- config_name: default
features:
- name: agent_info
sequence:
- name: Bottomline
dtype: string
- name: Role
dtype: string
- name: Target
dtype: float32
- name: agent_turn
sequence: int32
- name: dialogue_acts
sequence:
- name: intent
dtype: string
- name: price
dtype: float32
- name: utterance
sequence: string
- name: items
sequence:
- name: Category
dtype: string
- name: Images
dtype: string
- name: Price
dtype: float32
- name: Description
dtype: string
- name: Title
dtype: string
splits:
- name: train
num_bytes: 8538836
num_examples: 5247
- name: test
num_bytes: 1353933
num_examples: 838
- name: validation
num_bytes: 966032
num_examples: 597
download_size: 25373618
dataset_size: 10858801
- config_name: cryptonite
features:
- name: clue
dtype: string
- name: answer
dtype: string
- name: enumeration
dtype: string
- name: publisher
dtype: string
- name: date
dtype: int64
- name: quick
dtype: bool
- name: id
dtype: string
splits:
- name: train
num_bytes: 52228597
num_examples: 470804
- name: validation
num_bytes: 2901768
num_examples: 26156
- name: test
num_bytes: 2908275
num_examples: 26157
download_size: 21615952
dataset_size: 58038640
config_names:
- cryptonite
- default
---
# Dataset Card for Cryptonite
## Table of Contents
- [Dataset Description](#dataset-description)
- [Dataset Summary](#dataset-summary)
- [Supported Tasks and Leaderboards](#supported-tasks-and-leaderboards)
- [Languages](#languages)
- [Dataset Structure](#dataset-structure)
- [Data Instances](#data-instances)
- [Data Fields](#data-fields)
- [Data Splits](#data-splits)
- [Dataset Creation](#dataset-creation)
- [Curation Rationale](#curation-rationale)
- [Source Data](#source-data)
- [Annotations](#annotations)
- [Personal and Sensitive Information](#personal-and-sensitive-information)
- [Considerations for Using the Data](#considerations-for-using-the-data)
- [Social Impact of Dataset](#social-impact-of-dataset)
- [Discussion of Biases](#discussion-of-biases)
- [Other Known Limitations](#other-known-limitations)
- [Additional Information](#additional-information)
- [Dataset Curators](#dataset-curators)
- [Licensing Information](#licensing-information)
- [Citation Information](#citation-information)
- [Contributions](#contributions)
## Dataset Description
- **Homepage:** [Github](https://github.com/aviaefrat/cryptonite)
- **Repository:** [Github](https://github.com/aviaefrat/cryptonite)
- **Paper:** [Arxiv](https://arxiv.org/pdf/2103.01242.pdf)
- **Leaderboard:**
- **Point of Contact:** [Twitter](https://twitter.com/AviaEfrat)
### Dataset Summary
Current NLP datasets targeting ambiguity can be solved by a native speaker with relative ease. We present Cryptonite, a large-scale dataset based on cryptic crosswords, which is both linguistically complex and naturally sourced. Each example in Cryptonite is a cryptic clue, a short phrase or sentence with a misleading surface reading, whose solving requires disambiguating semantic, syntactic, and phonetic wordplays, as well as world knowledge. Cryptic clues pose a challenge even for experienced solvers, though top-tier experts can solve them with almost 100% accuracy. Cryptonite is a challenging task for current models; fine-tuning T5-Large on 470k cryptic clues achieves only 7.6% accuracy, on par with the accuracy of a rule-based clue solver (8.6%).
### Languages
English
## Dataset Structure
### Data Instances
This is one example from the train set.
```python
{
'clue': 'make progress socially in stated region (5)',
'answer': 'climb',
'date': 971654400000,
'enumeration': '(5)',
'id': 'Times-31523-6across',
'publisher': 'Times',
'quick': False
}
```
### Data Fields
- `clue`: a string representing the clue provided for the crossword
- `answer`: a string representing the answer to the clue
- `enumeration`: a string representing the
- `publisher`: a string representing the publisher of the crossword
- `date`: a int64 representing the UNIX timestamp of the date of publication of the crossword
- `quick`: a bool representing whether the crossword is quick (a crossword aimed at beginners, easier to solve)
- `id`: a string to uniquely identify a given example in the dataset
### Data Splits
Train (470,804 examples), validation (26,156 examples), test (26,157 examples).
## Dataset Creation
### Curation Rationale
Crosswords from the Times and the Telegraph.
### Source Data
#### Initial Data Collection and Normalization
[More Information Needed]
#### Who are the source language producers?
[More Information Needed]
### Annotations
#### Annotation process
[More Information Needed]
#### Who are the annotators?
[More Information Needed]
### Personal and Sensitive Information
[More Information Needed]
## Considerations for Using the Data
### Social Impact of Dataset
[More Information Needed]
### Discussion of Biases
[More Information Needed]
### Other Known Limitations
[More Information Needed]
## Additional Information
### Dataset Curators
Avia Efrat, Uri Shaham, Dan Kilman, Omer Levy
### Licensing Information
`cc-by-nc-4.0`
### Citation Information
```
@misc{efrat2021cryptonite,
title={Cryptonite: A Cryptic Crossword Benchmark for Extreme Ambiguity in Language},
author={Avia Efrat and Uri Shaham and Dan Kilman and Omer Levy},
year={2021},
eprint={2103.01242},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
```
### Contributions
Thanks to [@theo-m](https://github.com/theo-m) for adding this dataset.
提供机构:
aviaefrat
原始信息汇总
数据集概述
数据集名称
- 名称: Cryptonite
数据集特征
- 语言: 英语
- 许可证: cc-by-nc-4.0
- 多语言性: 单语种
- 大小:
- 100K<n<1M
- 1K<n<10K
- 源数据集: 原始数据
- 任务类别: 问答
- 任务ID: open-domain-qa
数据集结构
- 配置名称: default 和 cryptonite
- 特征:
- default配置:
agent_info: 包含Bottomline(字符串),Role(字符串),Target(浮点数)agent_turn: 整数dialogue_acts: 包含intent(字符串),price(浮点数)utterance: 字符串items: 包含Category(字符串),Images(字符串),Price(浮点数),Description(字符串),Title(字符串)
- cryptonite配置:
clue: 字符串answer: 字符串enumeration: 字符串publisher: 字符串date: 整数 (UNIX时间戳)quick: 布尔值id: 字符串
- default配置:
- 数据分割:
- default配置:
train: 5247个样本, 8538836字节test: 838个样本, 1353933字节validation: 597个样本, 966032字节
- cryptonite配置:
train: 470804个样本, 52228597字节validation: 26156个样本, 2901768字节test: 26157个样本, 2908275字节
- default配置:
数据集创建
- 注释创建者: 专家生成
- 语言创建者: 专家生成
许可证信息
- 许可证: cc-by-nc-4.0
引用信息
@misc{efrat2021cryptonite, title={Cryptonite: A Cryptic Crossword Benchmark for Extreme Ambiguity in Language}, author={Avia Efrat and Uri Shaham and Dan Kilman and Omer Levy}, year={2021}, eprint={2103.01242}, archivePrefix={arXiv}, primaryClass={cs.CL} }



