five

pinecone/core-2020-05-10-deduplication

收藏
Hugging Face2022-10-28 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/pinecone/core-2020-05-10-deduplication
下载链接
链接失效反馈
官方服务:
资源简介:
--- annotations_creators: - unknown language_creators: - unknown language: - en license: - mit multilinguality: - monolingual size_categories: - 100K<n<1M source_datasets: - unknown task_categories: - other task_ids: - natural-language-inference - semantic-similarity-scoring - text-scoring pretty_name: CORE Deduplication of Scholarly Documents tags: - deduplication --- # Dataset Card for CORE Deduplication ## Dataset Description - **Homepage:** [https://core.ac.uk/about/research-outputs](https://core.ac.uk/about/research-outputs) - **Repository:** [https://core.ac.uk/datasets/core_2020-05-10_deduplication.zip](https://core.ac.uk/datasets/core_2020-05-10_deduplication.zip) - **Paper:** [Deduplication of Scholarly Documents using Locality Sensitive Hashing and Word Embeddings](http://oro.open.ac.uk/id/eprint/70519) - **Point of Contact:** [CORE Team](https://core.ac.uk/about#contact) - **Size of downloaded dataset files:** 204 MB ### Dataset Summary CORE 2020 Deduplication dataset (https://core.ac.uk/documentation/dataset) contains 100K scholarly documents labeled as duplicates/non-duplicates. ### Languages The dataset language is English (BCP-47 `en`) ### Citation Information ``` @inproceedings{dedup2020, title={Deduplication of Scholarly Documents using Locality Sensitive Hashing and Word Embeddings}, author={Gyawali, Bikash and Anastasiou, Lucas and Knoth, Petr}, booktitle = {Proceedings of 12th Language Resources and Evaluation Conference}, month = may, year = 2020, publisher = {France European Language Resources Association}, pages = {894-903} } ```
提供机构:
pinecone
原始信息汇总

数据集概述

基本信息

  • 名称: CORE Deduplication of Scholarly Documents
  • 语言: 英语 (en)
  • 许可证: MIT
  • 多语言性: 单语种
  • 大小: 100K<n<1M
  • 任务类别: 其他
  • 任务ID:
    • natural-language-inference
    • semantic-similarity-scoring
    • text-scoring
  • 标签: deduplication

数据集描述

  • 摘要: CORE 2020 Deduplication dataset 包含100K学术文档,标记为重复/非重复。
  • 下载大小: 204 MB

引用信息

@inproceedings{dedup2020, title={Deduplication of Scholarly Documents using Locality Sensitive Hashing and Word Embeddings}, author={Gyawali, Bikash and Anastasiou, Lucas and Knoth, Petr}, booktitle = {Proceedings of 12th Language Resources and Evaluation Conference}, month = may, year = 2020, publisher = {France European Language Resources Association}, pages = {894-903} }

5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作