INSDC Host Organism Sequences

Name: INSDC Host Organism Sequences
Creator: GBIF
Published: 2024-12-07 00:00:00
License: 暂无描述

www.gbif.org2024-12-07 更新2025-01-15 收录

下载链接：

https://www.gbif.org/dataset/393b8c26-e4e0-4dd0-a218-93fc074ebf4e

下载链接

链接失效反馈

官方服务：

资源简介：

This dataset contains INSDC sequences associated with host organisms. The dataset is prepared periodically using the public ENA API (https://www.ebi.ac.uk/ena/portal/api/) using the methods described below.EMBL-EBI also publishes other records in separate datasets (https://www.gbif.org/publisher/ada9d123-ddb4-467d-8891-806ea8d94230).The data was then processed as follows:1. Human sequences were excluded.2. For non-CONTIG records, the sample accession number (when available) along with the scientific name were used to identify sequence records corresponding to the same individuals (or group of organism of the same species in the same sample). Only one record was kept for each scientific name/sample accession number.3. Contigs and whole genome shotgun (WGS) records were added individually.4. The records that were missing some information were excluded. Only records associated with a specimen voucher or records containing both a location AND a date were kept.5. The records associated with the same vouchers are aggregated together.6. A lot of records left corresponded to individual sequences or reads corresponding to the same organisms. In practise, these were "duplicate" occurrence records that weren't filtered out in STEP 2 because the sample accession sample was missing. To identify those potential duplicates, we grouped all the remaining records by `scientific_name`, `collection_date`, `location`, `country`, `identified_by`, `collected_by` and `sample_accession` (when available). Then we excluded the groups that contained more than 50 records. The rationale behind the choice of threshold is explained here: https://github.com/gbif/embl-adapter/issues/10#issuecomment-8557579787. To improve the matching of the EBI scientific name to the GBIF backbone taxonomy, we incorporated the ENA taxonomic information. The kingdom, Phylum, Class, Order, Family, and genus were obtained from the ENA taxonomy checklist available here: http://ftp.ebi.ac.uk/pub/databases/ena/taxonomy/sdwca.zipMore information available here: https://github.com/gbif/embl-adapter#readmeYou can find the mapping used to format the EMBL data to Darwin Core Archive here: https://github.com/gbif/embl-adapter/blob/master/DATAMAPPING.md

本数据集汇聚了与宿主生物相关的 INSDC 序列。该数据集通过定期使用公共 ENA API（https://www.ebi.ac.uk/ena/portal/api/）并依照以下所述方法编制而成。EMBL-EBI 还在其他独立的数据集中发布了其他记录（https://www.gbif.org/publisher/ada9d123-ddb4-467d-8891-806ea8d94230）。数据经过以下处理：1. 排除了人类序列。2. 对于非 CONTIG 记录，当样本访问号可用时，使用样本访问号和学名来识别与同一个体（或同一样本中同一物种的有机体群）相对应的序列记录。每个学名/样本访问号仅保留一条记录。3. 单独添加了 Contigs 和全基因组鸟枪法（WGS）记录。4. 排除了信息不全的记录。仅保留与标本凭证相关联的记录或同时包含地点和日期的记录。5. 将与同一凭证相关的记录汇总在一起。6. 剩余的大量记录对应于同一生物个体的序列或读取，实际上，这些是在第二步中由于样本访问号缺失而未被筛选出的“重复”发生记录。为了识别这些潜在的重复项，我们根据‘学名’、‘采集日期’、‘地点’、‘国家’、‘鉴定者’、‘采集者’和‘样本访问号’（当可用时）对剩余的所有记录进行分组。然后，排除了包含超过 50 条记录的组。选择此阈值的理由在以下链接中解释：https://github.com/gbif/embl-adapter/issues/10#issuecomment-855757978。7. 为提高 EBI 科学名与 GBIF 核心分类学的匹配度，我们融入了 ENA 分类学信息。界、门、纲、目、科和属的信息来源于此处可获得的 ENA 分类学清单：http://ftp.ebi.ac.uk/pub/databases/ena/taxonomy/sdwca.zip。更多详细信息请参阅：https://github.com/gbif/embl-adapter#readme。您可以在此找到用于将 EMBL 数据格式化为 Darwin Core 档案的映射：https://github.com/gbif/embl-adapter/blob/master/DATAMAPPING.md。

提供机构：

GBIF

5,000+

优质数据集

54 个

任务类型

进入经典数据集