five

health360/Healix-Shot

收藏
Hugging Face2023-09-09 更新2024-06-15 收录
下载链接:
https://hf-mirror.com/datasets/health360/Healix-Shot
下载链接
链接失效反馈
官方服务:
资源简介:
Healix-Shot是由Health 360提供的一个大规模医学文本数据集,包含220亿个令牌,涵盖了学术论文、医学百科全书、医学维基百科、教科书和医学新闻等多种资源。数据集经过严格的质量控制,采用‘Textbooks is All You Need’等方法和内部处理流程确保数据质量。该数据集适用于多种自然语言处理任务,如医学信息检索、自动摘要、问答系统、药物相互作用预测等。数据集遵循CC BY 4.0许可,鼓励公众使用和贡献。
提供机构:
health360
原始信息汇总

Healix-Shot: Largest Medical Corpora by Health 360

Healix-Shot, presented by Health 360, is a significant milestone in medical datasets, hosted on the HuggingFace repository. It contains 22 billion tokens, providing a comprehensive and high-quality corpus of medical text for medical NLP applications.

Importance:

  1. Comprehensive Knowledge: Covers a wide range of medical topics from academic papers, medical encyclopedias, and more.
  2. Quality Assured: Utilizes techniques like "Textbooks is All You Need" and internal processes to ensure high-quality data.
  3. Open-source Nature: Encourages communal contribution and innovation in medical NLP.

Dataset Composition:

Resource Tokens (Billions) Description
Filtered peS2o 19.2 High-quality medical papers
Various Sources 2.8 Medical Wikipedia, textbooks, medical news, etc.
Total 22.0

Methods:

  1. Textbooks is All You Need: Primary extraction and cleaning method emphasizing textbook knowledge.
  2. Internal Processing: Proprietary processes to ensure data purity and relevance.

Usage:

Healix-Shot is suitable for various NLP tasks including:

  • Medical information retrieval
  • Automatic summarization of medical articles
  • Medical question answering
  • Drug interaction prediction
  • And many more...

Licensing:

This dataset is open-source under the Creative Commons Attribution 4.0 International (CC BY 4.0) license.

5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作