JOURNAL ARTICLE

Improving transcriptome construction in non-model organisms: integrating manual and automated gene definition in Emiliania huxleyi

Ester Feldmesser, Shilo Rosenwasser, Assaf Vardi, Shifra Ben-Dor
BMC Genomics 2014, 15: 148
24559402

BACKGROUND: The advent of Next Generation Sequencing technologies and corresponding bioinformatics tools allows the definition of transcriptomes in non-model organisms. Non-model organisms are of great ecological and biotechnological significance, and consequently the understanding of their unique metabolic pathways is essential. Several methods that integrate de novo assembly with genome-based assembly have been proposed. Yet, there are many open challenges in defining genes, particularly where genomes are not available or incomplete. Despite the large numbers of transcriptome assemblies that have been performed, quality control of the transcript building process, particularly on the protein level, is rarely performed if ever. To test and improve the quality of the automated transcriptome reconstruction, we used manually defined and curated genes, several of them experimentally validated.

RESULTS: Several approaches to transcript construction were utilized, based on the available data: a draft genome, high quality RNAseq reads, and ESTs. In order to maximize the contribution of the various data, we integrated methods including de novo and genome based assembly, as well as EST clustering. After each step a set of manually curated genes was used for quality assessment of the transcripts. The interplay between the automated pipeline and the quality control indicated which additional processes were required to improve the transcriptome reconstruction. We discovered that E. huxleyi has a very high percentage of non-canonical splice junctions, and relatively high rates of intron retention, which caused unique issues with the currently available tools. While individual tools missed genes and artificially joined overlapping transcripts, combining the results of several tools improved the completeness and quality considerably. The final collection, created from the integration of several quality control and improvement rounds, was compared to the manually defined set both on the DNA and protein levels, and resulted in an improvement of 20% versus any of the read-based approaches alone.

CONCLUSIONS: To the best of our knowledge, this is the first time that an automated transcript definition is subjected to quality control using manually defined and curated genes and thereafter the process is improved. We recommend using a set of manually curated genes to troubleshoot transcriptome reconstruction.

Full Text Links

Find Full Text Links for this Article

Discussion

You are not logged in. Sign Up or Log In to join the discussion.

Related Papers

Remove bar
Read by QxMD icon Read
24559402
×

Save your favorite articles in one place with a free QxMD account.

×

Search Tips

Use Boolean operators: AND/OR

diabetic AND foot
diabetes OR diabetic

Exclude a word using the 'minus' sign

Virchow -triad

Use Parentheses

water AND (cup OR glass)

Add an asterisk (*) at end of a word to include word stems

Neuro* will search for Neurology, Neuroscientist, Neurological, and so on

Use quotes to search for an exact phrase

"primary prevention of cancer"
(heart or cardiac or cardio*) AND arrest -"American Heart Association"