Database Mining Tools in the Human Genome Initiative francais francais contact contact Amita Layer 17 Layer 2 Layer 3 Layer 4 Layer 5 Layer 6 Layer 7 Layer 8 Layer 9 Layer 10 Layer 11 Layer 12 Layer 13 Layer 14 Layer 15 Layer 16 Layer 2
White Paper Menu
BiodatabasesBiodatabases
Biodatabases
BiodatabasesHomeAbout AmitaOur LocationEventsLoginSite MapLogin
Biodatabases

3. Genomics

Genomics is defined as the scientific discipline which focuses on the systematic investigation of genomes, i.e. the complete set of chromosomes and genes of an organism. Genomics consists of two component areas:

    1. Structural genomics.
    2. Functional genomics.

Structural genomics refers to the large-scale determination of DNA sequences and gene mapping. Functional genomics refers to the attachment of information concerning functional activity to existing structural knowledge about DNA sequences. As the determination of the DNA sequences comprising the human genome nears completion, the Human Genome Initiative is undergoing a paradigm shift from static structural genomics to dynamic functional genomics. The current section will focus on structural genomics. Genomics has been reviewed.(8,11,26,37-45)

3.1. Genome Databases

As described in the survey of Pearson and Soll,(1) genome databases are used for the storage and analysis of genetic and physical maps. Chromosome genetic linkage maps represent distances between markers based on meiotic recombination frequencies. Chromosome physical maps represent distances between markers based on numbers of nucleotides.

Genome databases should define four data types:

    1. Sequence.
    2. Physical.
    3. Genetic.
    4. Bibliographic.

Sequence data should include annotated molecular sequences.

Physical data should include eight data fields:

    1. Sequence-tagged sites.
    2. Coding regions.
    3. Noncoding regions.
    4. Control regions.
    5. Telomeres.
    6. Centromeres.
    7. Repeats.
    8. Metaphase chromosome bands.

Genetic data should include seven data fields:

    1. Locus name.
    2. Location.
    3. Recombination distance.
    4. Polymorphisms.
    5. Breakpoints.
    6. Rearrangements.
    7. Disease association.

Bibliographic references should cite primary scientific and medical literature.

Genome databases are classified into four categories based on their contents:

    1. Molecular.
    2. Genetic.
    3. Organism.
    4. Bibliographic.

Molecular databases include four representative implementations:

    1. European Molecular Biology Laboratory Nucleotide Sequence Data Library (EMBL).(46) http://www.embl-heidelberg.de/
    2. DNA Database of Japan (DDBJ).(47)http://www.ddbj.nig.ac.jp/
    3. Genbank.(16) http://www.ncbi.nlm.nih.gov/Genbank/GenbankSearch.html
    4. Swiss-Prot.(48) http://www.expasy.ch/sprot/sprot-top.html

Genetic databases include two representative implementations:

    1. Genome Database (GDB).(49) http://gdbwww.gdb.org
    2. Online Mendelian Inheritance in Man (OMIM).(50) http://www3.ncbi.nlm.nih.gov/Omim/

Organism databases include three representative implementations:

    1. Bacterium Escherichia coli.(51)
    2. Mouse Mus musculus.(52)
    3. Mustard plant Arabidopsis thaliana.(53)

Bibliographic databases include four representative implementations:

    1. Biological Abstracts.
    2. CancerLit.
    3. Excerpta Medica (Embase).
    4. Medline.

Genome databases have been reviewed.(9,24,54-63)

3.2. Genome Database Mining

Genome database mining is an emerging technology. The process of genome database mining is referred to as computational genome annotation. Computational genome annotation is defined as the process by which an uncharacterized DNA sequence is documented by the location along the DNA sequence of all the genes that are involved in genome functionality. Computational genome annotation consists of two sequential processes:

    1. Structural annotation.
    2. Functional annotation.

Structural annotation refers to the identification of hypothetical genes termed open reading frames (ORFs) in a DNA sequence using computational gene discovery algorithms. Functional annotation refers to the assignment of functions to the predicted genes using sequence similarity searches against other genes of known function. Computational genome annotation has been reviewed.(64-66)

3.2.1. Computational Gene Discovery

Functionally-significant sites in DNA sequences have been studied and partially characterized using pattern recognition algorithms. DNA functional sites are sequences recognized and bound to by specific proteins, e.g. promoter elements. Sequence recognition algorithms exhibit performance tradeoffs between increasing sensitivity (ability to detect true positives) and decreasing selectivity (ability to exclude false positives). The identification of intron-exon boundaries and splice sites where ribonucleic acid (RNA) is transcribed from genomic DNA is of further importance. The ability to accurately predict introns would greatly facilitate the translation of genomic DNA into the amino acid sequence of the gene product. The comparative analysis of DNA sequences is an important technique in detecting biologically-significant relationships. Multiple sequence alignment is a useful technique in analyzing sequence-structure relationships. The DNA sequence of an unknown gene often exhibits structural homology with a known gene. Multiple sequence alignment is important for the recognition of patterns or motifs common to a set of functionally-related DNA sequences and is of assistance in structure prediction and molecular modeling. Multiple sequence alignment algorithms use variations of the dynamic programming method. Dynamic programming methods use an explicit measure of alignment quality, consisting of defined costs for aligned pairs of residues or residues with gaps and use an algorithm for finding an alignment with minimum total cost. Multiple sequence alignment has been reviewed.(67-68)

Computational gene discovery algorithms include twenty eight representative implementations:

    1. Aat.(69) http://genome.cs.mtu.edu/aat.html
    2. Banbury Cross. http://igs-server.cnrs-mrs.fr/igs/banbury/
    3. EcoParse.(70) (Not available on the World-Wide Web.)
    4. Fex.(71)http://dot.imgen.bcm.tmc.edu:9331/gene-finder/gf.html
    5. Gap 3.(72) (Not available on the World-Wide Web.)
    6. GeneID.(73) http://apolo.imim.es/geneid.html
    7. GeneMark.(74) http://genemark.biology.gatech.edu/GeneMark/
    8. GeneModeler.(75) (Not available on the World-Wide Web.)
    9. GeneParser.(76) http://beagle.colorado.edu/~eesnyder/GeneParser.html
    10. GeneParser2.(77) (Not available on the World-Wide Web.)
    11. GeneParser3.(77) (Not available on the World-Wide Web.)
    12. Genie.(78) http://www.fruitfly.org/seq_tools/genie.html
    13. GenLang.(79) http://www.cbil.upenn.edu/genlang/genlang_home.html
    14. Genscan.(80) http://ccr-081.mit.edu/GENSCAN.html
    15. GenViewer.(81) http://www.itba.mi.cnr.it/webgene/
    16. Glimmer.(82) http://www.cs.jhu.edu/labs/compbio/glimmer.html#get
    17. Grail.(83) http://compbio.ornl.gov/gallery.html
    18. Grail 2.(84) http://compbio.ornl.gov/gallery.html
    19. Great.(85) (Not available on the World-Wide Web.)
    20. Hexon / Fgeneh.(86) http://dot.imgen.bcm.tmc.edu:9331/gene-finder/gf.html
    21. Morgan.(87) http://www.cs.jhu/labs/compbio/morgan.html
    22. Mzef.(88) http://www.cshl.org/genefinder/
    23. ORFgene.(89) http://www.itba.mi.cnr.it/webgene/
    24. Procrustes.(90) http://www-hto.usc.edu/software/procrustes/index.html
    25. Sorfind.(91) http://www.rabbithutch.com
    26. Veil.(70) http://www.cs.jhu.edu/labs/compbio/veil.html
    27. Xgrail.(72) http://www.hgmp.embnet.org/Registered/Option/xgrail.html
    28. Xpound.(92) (Not available on the World-Wide Web.)

Computational gene discovery algorithms demonstrate limited performance accuracy in the prediction of eukaryotic genes.(93) Computational gene discovery algorithms have been been reviewed.(93-106)

3.2.2. Sequence Similarity Searching

Sequence similarity searching is an important methodology in computational molecular biology. Initial clues to understanding the structure or function of a molecular sequence arise from homologies to other molecules that have been previously studied. Genome database searches reveal biologically-significant sequence relationships and suggest future investigation strategies. As described in the survey of Altschul et al.,(107) molecular sequence database homology is affected by five factors:

    1. Algorithms.
    2. Scoring systems.
    3. Alignment statistics.
    4. Database updates.
    5. Database sequence bias.

Algorithms. Database search algorithms are based upon measures of local sequence similarity. Algorithms must balance the competing factors of speed, hardware requirements and sensitivity to biological relationships.

Scoring systems. Alignments are ranked by scores whose calculations are dependent upon the particular scoring systems used. The appropriate scoring system to use is largely dependant upon the problem under consideration.

Alignment statistics. Given a specific query, database search algorithms produce an ordered list of imperfectly-matched database similarities. An important question is defining the critical point of statistical significance.

Database updates. The use of a current comprehensive sequence database is essential to any similarity search.

Database sequence bias. There are biases in the molecules chosen to be included in molecular sequence databases.

Database search algorithms are used to compute pairwise comparisons between a candidate query sequence and each of the sequences stored within a database in order to find all the pairs of sequences that have a similarity above a defined threshold. There are three principal database search algorithms:

    1. Smith-Waterman algorithm.
    2. FASTA.
    3. BLAST.

The Smith-Waterman algorithm uses dynamic programming to compute the most sensitive pairwise similarity alignments. However, these optimal computations require execution in order quadratic time.(108) The Smith-Waterman algorithm has been implemented. http://decypher2.stanford.edu/algo-sw/SW_nn.html-ssi

The FASTA algorithm is an approximate heuristic algorithm used to compute suboptimal pairwise similarity comparisons. Dynamic programming is used to compute a series of subsequence alignments called hotspots which are combined to approximate a larger sequence alignment and global similarity score. Although not as optimal as the Smith-Waterman algorithm, the FASTA algorithm nevertheless executes in more rapid time and thus offers a tradeoff between comparison accuracy versus execution time.(20,109-110) The FASTA algorithm has been implemented. http://www-nbrf.georgetown.edu/pirwww/search/fasta.html

The BLAST (basic local alignment search tool) algorithm is another approximate heuristic algorithm used to compute suboptimal pairwise similarity comparisons. The BLAST algorithm uses the hotspot strategy of employing more stringent rules to locate fewer and better alignment hotspots. This strategy concentrates on finding regions of high local similarity in alignments without gaps although alignments with some gaps can be created by chaining together several locally similar regions. Hotspot extensions are attempted into the surrounding regions. The BLAST algorithm is an improvement over the similar FASTA algorithm by offering three advantages:

    1. More rapid execution time.
    2. Output includes a range of solutions.
    3. Each reported match is accompanied by an estimate of statistical significance.

Thus, the BLAST algorithm has become the dominant search engine for biological sequence databases.(20,111) The BLAST algorithm has been implemented. http://www.ncbi.nlm.nih.gov/BLAST/

Sequence similarity searching has been reviewed.(20,107,112-115)

4. Gene Expression

Gene expression is defined as the use of quantitative messenger RNA (mRNA)-level measurements of gene expression in order to characterize biological processes and elucidate the mechanisms of gene transcription. The objective of gene expression is the quantitative measurement of mRNA expression particularly under the influence of drug or disease perturbations.

As described in the survey of Carulli et al.,(116) the identification of differential gene expression associated with biological processes is a central research problem in molecular genetics. High throughput analysis of differential gene expression incorporates five technologies:

    1. Expressed sequence tags (ESTs).
    2. DNA microarrays.
    3. Subtractive cloning.
    4. Differential display.
    5. Serial analysis of gene expression (SAGE).
High throughput gene expression experiments are used for four purposes:
    1. Identification of novel genes.
    2. Identification of molecular markers for pathological processes.
    3. Identification of potential drug targets.
    4. Elucidation of molecular events associated with drug treatment in pharmacogenomics.

High throughput gene expression assays enable the simultaneous monitoring of thousands of genes in parallel and generate vast amounts of gene expression data. The large-scale investigation of gene expression attaches functional activity to structural genetic maps and therefore is an essential milestone in the paradigm shift from static structural genomics to dynamic functional genomics. High throughput gene expression technologies have been reviewed.(116-117)

4.1. Gene Expression Databases

Gene expression databases provide integrated data management and analysis systems for the transcriptional expression data generated by large-scale gene expression experiments. As described in the survey of Baldock and Davidson,(118) gene expression databases should include fourteen data fields:

    1. Gene expression assays.
    2. Database scope.
    3. Gene expression data.
      a. Gene name.
      b. Method or assay.
      c. Temporal information.
      d. Spatial information.
      e. Quantification.
      f. Gene products.
      g. User annotation of existing data.
      h. Linked entries.
      i. Links to other databases.
    4. Internet access.
    5. Internet submission.
Gene expression databases have not established defined standards for the collection, storage, retrieval and querying of gene expression data derived from libraries of gene expression experiments.

Human gene expression databases include eight representative implementations:

    1. Cellular Response Database.(119) http://LHI5.umbc.edu/crd
    2. dbEST.(120) http://www.ncbi.nlm.nih.gov/dbEST/index.html
    3. GeneCards.(121) http://bioinformatics.weizmann.ac.il/cards/
    4. Globin Gene Server.(122) http://globin.cse.psu.edu
    5. Human Developmental Anatomy. http://www.ana.ed.ac.uk/anatomy/database/humat/
    6. Kidney Development Database.(123)
    7. http://www.ana.ed.ac.uk/anatomy/database/kidbase//kidhome.html
    8. Merck Gene Index.(124) http://www.merck.com/mrl/merck_gene_index.2.html
    9. Tooth Gene Expression Database.(125) http://bite-it.helsinki.fi/

Gene expression databases have been reviewed.(9,118-119,126-132)

4.2. Gene Expression Database Mining

Gene expression database mining is an emerging technology. Gene expression database mining is used to identify intrinsic patterns and relationships in gene expression data. The identification of patterns in complex gene expression datasets provides two benefits:

    1. Generation of insight into gene transcription conditions.
    2. Characterization of multiple gene expression profiles in complex biological processes, e.g. pathological states.
As described in the survey of Bassett et al.,(131) gene expression data analysis uses two approaches:
    1. Hypothesis testing.
    2. Knowledge discovery.

Hypothesis testing investigates whether the induction or perturbation of a biological process leads to predicted results. Knowledge discovery detects internal structure in biological data. Knowledge discovery in gene expression data analysis employs two methodologies:

    1. Statistics, e.g. cluster analysis.
    2. Visualization.

Data visualization is used to display snapshots of cluster analysis results generated from large gene expression datasets. Gene expression database mining has been reviewed.(131,133-139) top

Next Page

AMITA Corporation, Copyright © 2000