A genome assembly was generated using a small pool (n=12) of male individuals from an inbred line of C. maculatus. PacBio sequences representing 32X genomic coverage with an average read length of 9.0 Kbp were assembled using FALCON, and error-corrected based on re-alignment of both PacBio (32X) and Illumina (125X) reads. The resulting assembly was 1.01 Gbp in total size, with an N50 of 149 Kbp and the longest contig spanning 2.1 Mpb.For the genome annotation, a first round of annotation was done with MAKER pipeline using evidence data: i) Proteins from the Uniprot-Swissprot database; ii) transcripts from the ten guided-assemblies performed with Stringtie and the de novo assembly performed with Trinity. This evidence-based gene build (rc1) contains 18551 gene models and 32349 mRNAs predicted.Gene models obtained from the first round of annotation was then used to train the ab initio tools Augustus (v2.7), Snap and GeneMark-ET (version 4.3).We next performed an Ab initio evidence-driven gene build called “evidence-driven” annotation. This round of annotation integrates the ab initio tools previously trained: Augustus, Snap and Genemark-ET and EVidenceModeler. The Ab initio evidence-driven gene build (rc2) contains 20564 gene models and 34331 mRNAs.At last we combine the Ab initio evidence-driven gene build (rc2) and the evidence gene build (rc1) using rc2 as reference build, to create the combined build named release candidate 3 (rc3). It contains 21264 gene models and 35160 mRNAs.With the final gene build (rc3), we proceeded to infer putative functions for all coding mRNAs. To this end, we first predicted functional domains using InterProscan (v5.7-48), to retrieve functional information from Interpro, PFAM, GO, MetaCyc, UniPathway, KEGG and Reactome. In order to assign protein and gene names to this dataset, we performed a BLASTp (version 2.2.28+) search with each of the predicted protein sequences against the Uniprot-Swissprot reference data set with e-value parameters (1x10-6).
Less...