TY - JOUR
T1 - Literature classification for semi-automated updating of biological knowledgebases
AU - Olsen, Lars Rønn
AU - Kudahl, Ulrich Johan
AU - Winther, Ole
AU - Brusic, Vladimir
N1 - Publisher Copyright:
© 2013 Olsen et al.
PY - 2013
Y1 - 2013
N2 - Background: As the output of biological assays increase in resolution and volume, the body of specialized biological data, such as functional annotations of gene and protein sequences, enables extraction of higher-level knowledge needed for practical application in bioinformatics. Whereas common types of biological data, such as sequence data, are extensively stored in biological databases, functional annotations, such as immunological epitopes, are found primarily in semi-structured formats or free text embedded in primary scientific literature. Results: We defined and applied a machine learning approach for literature classification to support updating of TANTIGEN, a knowledgebase of tumor T-cell antigens. Abstracts from PubMed were downloaded and classified as either "relevant" or "irrelevant" for database update. Training and five-fold cross-validation of a k-NN classifier on 310 abstracts yielded classification accuracy of 0.95, thus showing significant value in support of data extraction from the literature. Conclusion: We here propose a conceptual framework for semi-automated extraction of epitope data embedded in scientific literature using principles from text mining and machine learning. The addition of such data will aid in the transition of biological databases to knowledgebases.
AB - Background: As the output of biological assays increase in resolution and volume, the body of specialized biological data, such as functional annotations of gene and protein sequences, enables extraction of higher-level knowledge needed for practical application in bioinformatics. Whereas common types of biological data, such as sequence data, are extensively stored in biological databases, functional annotations, such as immunological epitopes, are found primarily in semi-structured formats or free text embedded in primary scientific literature. Results: We defined and applied a machine learning approach for literature classification to support updating of TANTIGEN, a knowledgebase of tumor T-cell antigens. Abstracts from PubMed were downloaded and classified as either "relevant" or "irrelevant" for database update. Training and five-fold cross-validation of a k-NN classifier on 310 abstracts yielded classification accuracy of 0.95, thus showing significant value in support of data extraction from the literature. Conclusion: We here propose a conceptual framework for semi-automated extraction of epitope data embedded in scientific literature using principles from text mining and machine learning. The addition of such data will aid in the transition of biological databases to knowledgebases.
UR - http://www.scopus.com/inward/record.url?scp=84904206044&partnerID=8YFLogxK
U2 - 10.1186/1471-2164-14-S5-S14
DO - 10.1186/1471-2164-14-S5-S14
M3 - Article
C2 - 24564403
AN - SCOPUS:84904206044
SN - 1471-2164
VL - 14
JO - BMC Genomics
JF - BMC Genomics
M1 - S14
ER -