five

Simulated short-read sequence dataset of Fictus yaponesiae

收藏
NIAID Data Ecosystem2026-05-02 收录
下载链接:
https://zenodo.org/record/14776092
下载链接
链接失效反馈
官方服务:
资源简介:
Description of contents Once you download yaponesiae.tar.gz and extract it using the following command.   tar -zxvf yaponesiae.tar.gz   You will find a README document along with several directories. The directory contains short-read DNA sequences and a variant dataset generated through simulation to test and validate population genomic data analysis pipelines. The generated datasets are stored in data directory. Scripts used to generate the data are in script directory. Author: Naoki Osada (Hokkaido University, nosada@ist.hokudai.ac.jp) Disclaimer The data and materials provided are published under the Creative Commons Attribution 4.0 International (CC BY 4.0) license. This means you are free to share, adapt, and use the materials for any purpose, provided proper credit is given to the original creators. The authors or creators are not liable for any damages, misuse, or unintended consequences arising from their use. Simulation The dataset consists of whole-genome sequencing data from five populations of a hypothetical species, Fictus yaponesiae, which has a small genome size of 23.5 Mbp. Variants were simulated using the forward population genetic simulation tool SLiM3 (DOI: 10.1093/molbev/msy228). The SLiM parameter file can be found in script/yaponesiae_5pop_simulation.txt. Initially, the populations were labeled as p1 to p5, but these were later renamed to represent geographic locations corresponding to cities in Japan: p1: Sapporo (SP) p2: Tokyo (TK) p3: Fukuoka (FK) p4: Osaka (OS) p5: Sendai (SD) The population history is illustrated in the figure img/simulation_setup.png. A single beneficial mutation was introduced very recently in population p5, and population p4 received a migration influx of 10% from p5 over two generations. The full simulation scenario is detailed in the figure. Converting to nucleotide variant data and nucleotide sequences The VCF files generated by SLiM (merged.vcf.gz) do not contain nucleotide variant data. The script script/convert2nuc.py assigns nucleotide sequences to the VCF file. A transition-to-transversion rate of 2 was assumed. The genome sequence is based on the 2L chromosome sequence of Drosophila melanogaster (script/fasta/dMel2L.fasta). Note that the input and output VCF files are correctly phased. Reconstructed FASTA files for all haplotypes were also generated using the same script. The haploid genome sequence of the first individual from p1 was selected as the reference genome for F. yaponesiae (stored in data/3/yapnesia_reference.fasta). Simulating short reads from reconstructed FASTA file Sequences of short reads in FASTQ format were generated for each individual using the ART sequencing simulator (`DOI:10.1093/bioinformatics/btr708`). The script art_batch.sh in the script directory was used for this process.   The reconstructed short reads were mapped to the reference genome using bwa, and SNVs were called following the GATK4.0 best practice pipeline. A VCF file containing all samples is found in data/6/yaponesia.vcf.gz.   A subset of FASTQ sequences can be found in data/2/ ddRAD-seq data ddRAD-seq data was generated using the script script/simrad.sh. The EcoI-MseI digested ddRAD reads were simulated with ddRADseqTools (10.1111/1755-0998.12550) and sequencing errors were introduced using simNGS (https://www.ebi.ac.uk/goldman-srv/simNGS). Detailed parameter settings are specified in script/simrad.sh. The ddRAD-seq data for all individuals is stored in data/11/rawdata/.
创建时间:
2025-03-18
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作