Application of a genetic algorithm to finish a genome sequence
SOUZA, Eliel. M.1; CRUZ, L. M.1; SOUZA, Emanuel. M.1; PEDROSA, F. O.1; MONTEIRO, R. A.1; WASSEM, R.2; STEFFENS, M. B. R.1; RIGO, L. U.1; CHUBATSU, L. S.1; RAITZ, R. T.3
Departamento de Bioquímica e Biologia Molecular, UFPR 1; Departamento de Genética, UFPR 2; Escola Técnica UFPR. 3
The goal of a genome sequencing project is to obtain the complete sequence of all the replicons of an organism. The Genome Program of Parana (GENOPAR) is sequencing the genome of the nitrogen fixing-bacterium Herbaspirillum seropedicae. Approximately 120,000 shotgun reads were produced by a consortium of 13 laboratories located in the Paraná and Santa Catarina States of Brazil. The sequences were assembled by the PHRED, PHRAP/CONSED software package. However, these programs generated several long contigs but no single complete sequence. This may occur for many reasons, such as: a) the presence of repetitive sequences; b) a region or regions not covered by the sequence; c) regions of poor quality sequences. Thus, to complete the genome sequence it is necessary to order the contigs produced into scaffolds and close the gaps using sequences already in the database in addition to newly obtained sequences to cover physical discontinuities. We have used a genetic algorithm to automatically assist in identifying and sequencing the gap between contigs to enable us to complete the genome sequence. The algorithm uses information of contig and cosmid-end sequences and plasmid sequences in an intelligent search of contig links to obtain a scaffold assembly of the virtual genome sequence. Using this genetic algorithm allowed us to join 234 contigs of the H. seropedicae genome sequence assembly into 30 scaffolds. Using the same set of contigs, the AUTOFINISH program of the PHRED/PHRAP/CONSED package suggested 51 scaffolds. The complete genome of H. seropedicae Z78 is now finished.
Supported by CNPq/MCT, Fundo Parana.
|