The applications of computer technology in biology date as early as the 1960s, progressing rapidly in the last decade and evolving into the emergence of new field of bioinformatics. Bioinformatics combines the elements of biological sciences, biotechnology, computer science, and mathematics. Recent advances in biotechnology have enabled measurement of biological systems on a massive scale. Newly developed methods and instrumentation, such as high throughput sequencing and automation in genomics and proteomics, generate volumes of raw biological data at an explosive rate. In parallel with the growth of data, numerous computational tools for improved data analysis and management have emerged. These tools help extract relevant parts of the data (data reduction), establish correlations between different views of data (correlation analysis), and convert the information to knowledge discoveries (data mining). In addition, recent research has expanded into data storage and data management focusing on structure of the databases (data modeling), storage media (relational, flat file-based, XML, and others), and quality assurance of data. The knowledge-based era of modern biological research seeks to combine data management systems with sophisticated data analysis tools, thus defining some of the major current activities in bioinformatics. Molecular biology data management systems usually take the form of publicly accessible biological databases. A database is designed to manage a large amount of persistent, homogeneous, and structured data that isshared among distributed users and processes (Bressan, 2002). When a dataset is organized in the form of a database, it must remain manageable and usable, supporting both data growth and increase in the number of database queries. In bioinformatics, the development of databases has been driven by an explosive growth of data as well as increasing user access to this data. For example, the number of entries in SWISS-Prot (www.expasy. org), a major public protein database and in DNA Data Bank of Japan (www.ddbj.nic.ac.jp), a major web accessible DNA databank, has grown rapidly from 999 to 2003. The number of accesses to Swiss-Prot has grown by approximately one million added connections per year (Tables 3.1 and 3.2). [Table presented]. The growth of biological data resulted mainly from the large volume of nucleotide sequences generated from the genome sequencing projects. The first viral genome, bacteriophage FX-174, containing 5,386 base pairs (bps) was sequenced in 1978 (Sanger et al., 1978). More than a decade later, the first free-living organism, Haemophilus influenzae, containing 1.8 million base pairs, was sequenced (Fleischmann et al., 1995). The human genome of some 3.5 billion bp was published in 2001 (Lander et al. 2001), followed by the publication of mouse genome a year later (Waterston et al., 2002). Today, more than 1,500 viral genomes, 110 bacterial and archaea genomes and 20 eukaryotic genomes have been sequenced. Because of the alternative splicing of the messenger RNA (Fields, 2001) it is estimated that some 30,000 human genes encode as much as ten times more proteins. Rapid accumulation of genomic sequences, followed by a mounting pool of protein sequences, and three-dimensional (3D) structures will continue to fuel the development of database technologies for managing these data. [Table prensented]. Numerous databases have been created to store and manage the nucleotide sequences and related views of the same data, such as 3D biological macromolecular structures, protein sequences, physical maps, and structural or functional domains. Among the most significant DNA databases are DDBJ, GenBank (www.ncbi.nlm.nih.gov/Genbank), and EMBL (www.ebi.ac.uk/Databases). Major protein databases are Swiss-Prot, TrEMBL (www.ebi.ac.uk/trembl), PIR (pir.georgetown.edu/pirwww/pirhome3.shtml), and Protein Data Bank (www.pdb.org). Each of these databases usually provides a single specific view of the data. For example, PDB contains 3D biological macromolecular structures. However, researchers typically utilize diverse information from multiple databases to support planning of experiments or analysis and interpretation of results. The common practice of manually accessing and compiling extracted data of dissimilar views can be very costly and time consuming. The concept of data warehousing, a convenient solution to managing different views of data and ensuring data interoperability, has been recently applied in bioinformatics. A biological data warehouse is a subject-oriented, integrated, non-volatile, expert interpreted collection of data in support of biological data analyses and knowledge discovery (Schönbach et al., 2000). This definition suggests that a data warehouse is organized around specific subject. The goal of constructing a data warehouse is to facilitate high-level analysis, summarization of information, and extraction of new knowledge hidden in the data. We refer to the databases that provide raw data for the data warehouse as data sources. In this chapter, we introduce the basic concepts of data warehousing and discuss the role of data warehousing for improved data analysis and data management. We present several case studies and discuss the lessons learned. This chapter focuses on a) describing the nature of biological data and problems frequently encountered in managing them, b) transforming data into knowledge using data warehousing, c) data warehousing principles and the basic architecture of a biological data warehouse, and d) data quality.
ASJC Scopus subject areas
- General Computer Science