health360/Healix-Shot
收藏Hugging Face2023-09-09 更新2024-06-15 收录
下载链接:
https://hf-mirror.com/datasets/health360/Healix-Shot
下载链接
链接失效反馈官方服务:
资源简介:
Healix-Shot是由Health 360提供的一个大规模医学文本数据集,包含220亿个令牌,涵盖了学术论文、医学百科全书、医学维基百科、教科书和医学新闻等多种资源。数据集经过严格的质量控制,采用‘Textbooks is All You Need’等方法和内部处理流程确保数据质量。该数据集适用于多种自然语言处理任务,如医学信息检索、自动摘要、问答系统、药物相互作用预测等。数据集遵循CC BY 4.0许可,鼓励公众使用和贡献。
提供机构:
health360
原始信息汇总
Healix-Shot: Largest Medical Corpora by Health 360
Healix-Shot, presented by Health 360, is a significant milestone in medical datasets, hosted on the HuggingFace repository. It contains 22 billion tokens, providing a comprehensive and high-quality corpus of medical text for medical NLP applications.
Importance:
- Comprehensive Knowledge: Covers a wide range of medical topics from academic papers, medical encyclopedias, and more.
- Quality Assured: Utilizes techniques like "Textbooks is All You Need" and internal processes to ensure high-quality data.
- Open-source Nature: Encourages communal contribution and innovation in medical NLP.
Dataset Composition:
| Resource | Tokens (Billions) | Description |
|---|---|---|
| Filtered peS2o | 19.2 | High-quality medical papers |
| Various Sources | 2.8 | Medical Wikipedia, textbooks, medical news, etc. |
| Total | 22.0 |
Methods:
- Textbooks is All You Need: Primary extraction and cleaning method emphasizing textbook knowledge.
- Internal Processing: Proprietary processes to ensure data purity and relevance.
Usage:
Healix-Shot is suitable for various NLP tasks including:
- Medical information retrieval
- Automatic summarization of medical articles
- Medical question answering
- Drug interaction prediction
- And many more...
Licensing:
This dataset is open-source under the Creative Commons Attribution 4.0 International (CC BY 4.0) license.



