Whole genome sequencing provides the all-inclusive information of entire genetic material of an organism. There are two approaches for assembling high throughput sequencing reads into longer contiguous genomic sequences. Sequencing of unknown genomes (no reference sequence available) De novo based, where sequenced reads are compared to each other, and then overlapped reads were used to build longer contiguous sequences. Contigs are oriented and ordered by long jumping distance (LJD) libraries. The reference-based assembly approach involves mapping each read to a reference genome sequence to identify genetic variation like single nucleotide polymorphisms(SNPs), indels, insertions, copy number variants, genome wide association studies (GWAS) and building haplotypes from genome assemblies.
Eurofins genomics provides various combination of sequencing platforms like Illumina HiSeq2500, MiSeq, NextSeq500, PacBio with various read length and libraries (paired-end and mate-pair) for whole genome sequencing of human, animals, plants and microorganisms like bacteria, virus, fungus. Eurofins Genomics offers libraries with wide jumping distances to easily tackle any genome size - from bacterial genomes to large and complex eukaryotic genomes. The LJD fragments comprise of 2 DNA fragments originally 3 kbp and 8 kbp apart in the genome of interest. The long paired-end reads are used to determine the orientation and relative position of the contigs generated during the data assembly of shotgun reads. These unique long jumping distance (LJD) libraries offer ultra-high throughput and cost-efficient scaffolding of contigs.
Eurofins provides various genome assembly services as listed below:
- Bacterial/Fungus Denovo Genome Assembly
- Pacbio Bacterial/Fungus Denovo Assembly
- Denovo Genome Assembly upto 1Gb
- Large Genome Denovo Assembly >1Gb
- Fungus hybrid Denovo assembly and analysis
- Large genome hybrid Denovo assembly and analysis
- Reference guided genome analysis
- PacBio reference guided analysis
Bioinformatics workflow for genome analysis:
Quality check of raw reads:
The raw reads will be subjected to quality filtration and adapter trimming using Trimmomatic software. The primer sequences, poly(A) tails and reads produced from ribosomal DNA templates will be removed. The high quality data will be used for downstream analysis.
Denovo Assembly :
The denovo assembly of high quality reads will be carried out using the any one of the assembler mentioned Velvet/Spades/SOAPdenovo assembler. The assembler is often sensitive to the input parameters, so we will perform multiple Kmer assembly runs to optimize the assembly. The PE data will be assembled using various Kmer length, coverage cutoff, insert length, insert length standard deviation, expected coverage for scaffold assembly. The best assembly will be selected based on scaffold N50 and max scaffold length. Then final assembly will be evaluated based on scaffolds N50, assembly coverage (depth), reads participated in assembly, GC content, assembly completeness and accuracy.
Reference Based analysis
The reference genome and gene information (GFF3 file) are downloaded from public databases.
High quality reads are aligned against the reference genome using BWA mem with optimized parameters.
Mapping was performed in two following steps:
- Indexing of reference genome
- Aligning filtered reads to the reference index
Gene prediction
Ab initio gene predictors are statistical models which are trained to find features of genes, such start and stop codons, CDS of the genes. The draft genome assembly will be used as input in Prodigal/Augustus gene prediction program to predict the coding region in the given sample.
Annotation
The predicted coding regions will be annotated against NCBI non redundant protein database(Nr), Swissprot, Kyoto Encyclopedia of Genes and Genomes(KEGG), Cluster of Orthologous Group(COG) databases using Basic local alignment search tool (BlastX)(E-value ≤ 1e-05).
COG is a database that classifies gene products into different clusters of orthologous groups.
In biological pathways, coding region will mapped to reference canonical pathways in KEGG. All the coding genes classified mainly under five categories: Metabolism, Cellular processes, Genetic information processing, Environmental information processing. The output of KEGG analysis includes KEGG Orthology (KO) assignments and Corresponding Enzyme commission (EC) numbers and metabolic pathways of predicted coding genes using KEGG automated annotation server.
Gene ontology (GO) annotations of the coding genes will be determined by Blast2GO. GO terms will be assigned to coding genes for functional categorization. Genes will be categorized into categories namely biological process, molecular functions, and cellular component.
SNP Discovery
Putative SNPs are discovered from the alignment file generated by BWA/Bowtie program in BAM format. Standard pipeline of samtools/GATK are used with optimized parameter, to call SNPs and INDELs.
Comparative Genome Analysis /Synteny analysis
The comparative genomics will be carried out with closely related species. To generate a pair-wise alignment between draft genome and other closely related species, Blastp with E-value < 1e-20 will be used. To detect the synteny between genes in different species we will use OrthoMCL. Identification of core gene will be carried using close reference genomes.
Deliverables
Denovo genome assembly
- Quality filtration of reads
- Denovo assembly generating scaffolds/contigs,
- Assembly statistics
- Insilco validation of assembly using RNA-Seq data (for large complex plants)
- GC percentage
- Repeat identification
- Gene prediction
- Gene annotation
- GO analysis
- SSR discovery
- SNP/Indel discovery(if more than one sample)
- Phylogenetic analysis
- KEGG pathway analysis
- Comparative genomics with closely related genomes
- Circos plot
- COG orthologous groups analysis using Orthomcl
- Core gene analysis
- Comprehensive report with publication standard methodology, graphs and tables.
Reference guided genome analysis
- Quality filtration of PE reads
- Mapping of high quality reads to the reference genome
- Alignment summary (reads mapped, uniquely mapped reads, reads unmapped, genome coverage)
- Consensus sequence in fasta format
- Gene prediction using gtf/gff
- SNP/Indels identification
- SNP/Indels annotation
- Core gene analysis using orthomcl
- Comparative genomics
- Synteny plot
- Phylogenetic analysis
- Circos plot (genomic features)
- Comprehensive report with publication standard methodology, graphs and tables.