Sequence capture design targeting NLR genes of barley (Hordeum vulgare L.)
收藏DataCite Commons2020-08-30 更新2024-07-27 收录
下载链接:
https://figshare.com/articles/Sequence_capture_design_targeting_NLR_genes_of_barley_Hordeum_vulgare_L_/5928766
下载链接
链接失效反馈官方服务:
资源简介:
Previously, Jupe et al. developed a motif-based approach for the identification of NB-LRRs encoding genes (Jupe et al. 2012). We developed a similar motif set by using the diversity of NB-LRRs in rice and <i>B. distachyon</i>. The NB-LRRome of rice is estimated to include 508 NB-LRRs, respectively (Li 2010), whereas differing estimates of the number of NB-LRR encoding genes have been reported for <i>B. distachyon</i>, including 212 (Li 2010) and 175 NB-LRRs (Tan 2012). We generated MEME motifs through a randomized proportional sample of NB-LRRs from rice (N=35) and <i>B. distachyon</i> (N=17). The MEME motifs spanned the CC domain (motifs 4, 11, 13, and 15), NBS domain (motifs 1, 2, 3, 5, 6, 7, 8, 10, 12, and 14), and the LRR domain (motifs 19, 9, 20, 16, 17, and 18). All the identified motifs could clearly be associated with those previously defined by Meyers <i>et al.</i> (2003) Plant Cell, highlighting the general higher conservation in the NB domain and key residues in the CC and LRR domains. In contrast to the motifs identified by Meyers <i>et al</i>. for NB-LRRs of <i>Arabidopsis thaliana</i>, the MEME motifs had the advantage of being specifically tailored for monocot NB-LRRs. MAST significance thresholds of 1e-27 and 1e-20 were found to identify all annotated NB-LRRs within <i>B. distachyon</i>, with precision of 49.8% and 47.5% based on the NB-LRR annotation of Tan and Wu (2012), respectively. Our next step was to systematically identify all sequence related to NB-LRRs in the genome and transcriptome of barley. We identified and extracted the longest open reading frames (ORFs) for every transcript variant for the eight leaf transcriptome assemblies. Translated ORFs were reduced to the minimal set based on string comparisons within a transcript group, which can include alternative splice variants. The translated ORFs were scanned using MAST to identify NB-LRR containing sequence using an e-value threshold of 1e-20. The same approach was used to screen the full-length cDNAs (FLcDNAs) from Haruna Nijo, resulting in the identification of 139 FLcDNA containing putatively encoding NB-LRRs from a total of 27,465 FLcDNAs. The current genome of barley is based on a whole genome shotgun (WGS) assembly (IBSGC 2012). The majority of NB-LRRs from the grasses contain introns and are often associated with sequence with reduced complexity such as simple sequence repeats, retrotransposons, low complexity). Therefore, a relaxed approach was used to identify genomic contigs that contain fragments or entire NB-LRR encoding genes. For every WGS contig in the assemblies of the sequenced cultivars Morex, Barke, and Bowman, all six ORFs were translated for each contig and concatenated into a single peptide sequence for the forward and reverse strand. The only requirement in a translated ORF was that peptide sequences were at least 100 residues in length. Translated genomic contigs were scanned using FIMO, which assesses all twenty MEME-generated motifs independently. As NB-LRR genes are likely to be fragmented, we required that one of two conditions be met for inclusion on the sequence capture: (1) at least one CC and two NBS motifs or (2) at least two NBS and one LRR motifs are present in the translated sequence strand. In parallel with the <i>de novo</i> identification of NB-LRRs, we included additional sequences due to their known relevance in disease resistance in barley. This included the <i>Mla</i> locus from Morex (Wei et al. (2002) Plant Cell), all cloned alleles of <i>Mla</i> (Seeholzer et al. 2009), the <i>Mlo</i> locus (Büschges 1997), the <i>Rpg1</i> locus (Brueggeman 2002), and the <i>rpg4</i>/<i>Rpg5</i> locus (Brueggeman 2008). All sequence described above was used as a template to design the capture assay, including the entire genomic context of contigs containing signatures of NB-LRRs encoding genes. As several different genomes and transcriptomes were used in the design, an extensive amount of redundancy exists in the design. To remove redundancy, we fragmented the input data set into 100 bp fragments with a scanning window of 25 bp and performed BLAST back onto the entire data set. Any sequence found to have identity of 95% or higher was considered redundant. The first occurrence of the sequence would be retained and the others were masked. The inclusion of extensive genomic sequence will introduce repetitive sequence that can produce competition in the sequence capture due to the high copy number of repetitive sequence in the barley genome. Therefore, two approaches were used to remove repetitive sequence in the sequence capture design. All loci were repeat masked based using RepeatMasker (v4.0.5) using default and Triticeae-specific repeat databases. As repeat databases are not complete, we applied genomic masking of the capture design. To do so, we fragmented the input data set into 100 bp fragments with a scanning window of 50 bp and performed BLAST onto the Morex WGS assembly. A threshold of eight or fewer copies was found selected to balance between copy number variation within NB-LRRs and avoiding the inclusion of repetitive sequence. The final design of the tilling array including 99,421 100 mer baits with 2.1x coverage.
提供机构:
figshare
创建时间:
2018-02-27



