TY - GEN
T1 - BioDArt - Catalogue of biological data artifact examples
AU - Veeramani, Anitha
AU - Gopalakrishnan, Kavitha
AU - Brusic, Vladimir
AU - Koh, Judice L.Y.
PY - 2006
Y1 - 2006
N2 - Information in biological data repositories continues to grow exponentially due to the increasing genomic and proteomic sequencing projects. As with any database, these data repositories are subjected to data quality issues related to correctness, uniformity, completeness, redundancy, among others. Data cleaning is a prerequisite to prevent the interference of low quality data with the accuracy of data mining and analysis. This in turn involves the detection and resolution of data artifacts (errors, discrepancies, redundancies, ambiguities, and incompleteness). Understanding the causes of data artifacts and systematically classifying them are critical towards their elimination in molecular sequence databases. This paper highlights eight data artifacts found among public molecular databases. Examples of major molecular sequence database records containing these artifacts are collected into the BioDArt catalogue (http://antigen.i2r.a-star.edu.sg/BioDArt).
AB - Information in biological data repositories continues to grow exponentially due to the increasing genomic and proteomic sequencing projects. As with any database, these data repositories are subjected to data quality issues related to correctness, uniformity, completeness, redundancy, among others. Data cleaning is a prerequisite to prevent the interference of low quality data with the accuracy of data mining and analysis. This in turn involves the detection and resolution of data artifacts (errors, discrepancies, redundancies, ambiguities, and incompleteness). Understanding the causes of data artifacts and systematically classifying them are critical towards their elimination in molecular sequence databases. This paper highlights eight data artifacts found among public molecular databases. Examples of major molecular sequence database records containing these artifacts are collected into the BioDArt catalogue (http://antigen.i2r.a-star.edu.sg/BioDArt).
KW - Data artifacts
KW - Data cleaning
KW - Data quality
UR - http://www.scopus.com/inward/record.url?scp=46249105561&partnerID=8YFLogxK
U2 - 10.1109/ICBPE.2006.348608
DO - 10.1109/ICBPE.2006.348608
M3 - Conference contribution
AN - SCOPUS:46249105561
SN - 8190426249
SN - 9788190426244
T3 - ICBPE 2006 - Proceedings of the 2006 International Conference on Biomedical and Pharmaceutical Engineering
SP - 324
EP - 329
BT - ICBPE 2006 - Proceedings of the 2006 International Conference on Biomedical and Pharmaceutical Engineering
T2 - ICBPE 2006 - 2006 International Conference on Biomedical and Pharmaceutical Engineering
Y2 - 11 December 2006 through 14 December 2006
ER -