Literature classification for semi-automated updating of biological knowledgebases

Lars Rønn Olsen; Ulrich Johan Kudahl; Ole Winther; Vladimir Brusic

doi:10.1186/1471-2164-14-S5-S14

Literature classification for semi-automated updating of biological knowledgebases

Lars Rønn Olsen, Ulrich Johan Kudahl, Ole Winther, Vladimir Brusic

Research output: Journal Publication › Article › peer-review

9 Citations (Scopus)

Abstract

Background: As the output of biological assays increase in resolution and volume, the body of specialized biological data, such as functional annotations of gene and protein sequences, enables extraction of higher-level knowledge needed for practical application in bioinformatics. Whereas common types of biological data, such as sequence data, are extensively stored in biological databases, functional annotations, such as immunological epitopes, are found primarily in semi-structured formats or free text embedded in primary scientific literature. Results: We defined and applied a machine learning approach for literature classification to support updating of TANTIGEN, a knowledgebase of tumor T-cell antigens. Abstracts from PubMed were downloaded and classified as either "relevant" or "irrelevant" for database update. Training and five-fold cross-validation of a k-NN classifier on 310 abstracts yielded classification accuracy of 0.95, thus showing significant value in support of data extraction from the literature. Conclusion: We here propose a conceptual framework for semi-automated extraction of epitope data embedded in scientific literature using principles from text mining and machine learning. The addition of such data will aid in the transition of biological databases to knowledgebases.

Original language	English
Article number	S14
Journal	BMC Genomics
Volume	14
DOIs	https://doi.org/10.1186/1471-2164-14-S5-S14
Publication status	Published - 2013
Externally published	Yes

ASJC Scopus subject areas

Biotechnology
Genetics

UN SDGs

This output contributes to the following UN Sustainable Development Goals (SDGs)

Access to Document

10.1186/1471-2164-14-S5-S14

Cite this

@article{f9a3fce70bb74703a942bbc54fb10835,

title = "Literature classification for semi-automated updating of biological knowledgebases",

abstract = "Background: As the output of biological assays increase in resolution and volume, the body of specialized biological data, such as functional annotations of gene and protein sequences, enables extraction of higher-level knowledge needed for practical application in bioinformatics. Whereas common types of biological data, such as sequence data, are extensively stored in biological databases, functional annotations, such as immunological epitopes, are found primarily in semi-structured formats or free text embedded in primary scientific literature. Results: We defined and applied a machine learning approach for literature classification to support updating of TANTIGEN, a knowledgebase of tumor T-cell antigens. Abstracts from PubMed were downloaded and classified as either {"}relevant{"} or {"}irrelevant{"} for database update. Training and five-fold cross-validation of a k-NN classifier on 310 abstracts yielded classification accuracy of 0.95, thus showing significant value in support of data extraction from the literature. Conclusion: We here propose a conceptual framework for semi-automated extraction of epitope data embedded in scientific literature using principles from text mining and machine learning. The addition of such data will aid in the transition of biological databases to knowledgebases.",

author = "Olsen, {Lars R{\o}nn} and Kudahl, {Ulrich Johan} and Ole Winther and Vladimir Brusic",

note = "Publisher Copyright: {\textcopyright} 2013 Olsen et al.",

year = "2013",

doi = "10.1186/1471-2164-14-S5-S14",

language = "English",

volume = "14",

journal = "BMC Genomics",

issn = "1471-2164",

publisher = "BioMed Central Ltd.",

}

TY - JOUR

T1 - Literature classification for semi-automated updating of biological knowledgebases

AU - Olsen, Lars Rønn

AU - Kudahl, Ulrich Johan

AU - Winther, Ole

AU - Brusic, Vladimir

PY - 2013

Y1 - 2013

N2 - Background: As the output of biological assays increase in resolution and volume, the body of specialized biological data, such as functional annotations of gene and protein sequences, enables extraction of higher-level knowledge needed for practical application in bioinformatics. Whereas common types of biological data, such as sequence data, are extensively stored in biological databases, functional annotations, such as immunological epitopes, are found primarily in semi-structured formats or free text embedded in primary scientific literature. Results: We defined and applied a machine learning approach for literature classification to support updating of TANTIGEN, a knowledgebase of tumor T-cell antigens. Abstracts from PubMed were downloaded and classified as either "relevant" or "irrelevant" for database update. Training and five-fold cross-validation of a k-NN classifier on 310 abstracts yielded classification accuracy of 0.95, thus showing significant value in support of data extraction from the literature. Conclusion: We here propose a conceptual framework for semi-automated extraction of epitope data embedded in scientific literature using principles from text mining and machine learning. The addition of such data will aid in the transition of biological databases to knowledgebases.

AB - Background: As the output of biological assays increase in resolution and volume, the body of specialized biological data, such as functional annotations of gene and protein sequences, enables extraction of higher-level knowledge needed for practical application in bioinformatics. Whereas common types of biological data, such as sequence data, are extensively stored in biological databases, functional annotations, such as immunological epitopes, are found primarily in semi-structured formats or free text embedded in primary scientific literature. Results: We defined and applied a machine learning approach for literature classification to support updating of TANTIGEN, a knowledgebase of tumor T-cell antigens. Abstracts from PubMed were downloaded and classified as either "relevant" or "irrelevant" for database update. Training and five-fold cross-validation of a k-NN classifier on 310 abstracts yielded classification accuracy of 0.95, thus showing significant value in support of data extraction from the literature. Conclusion: We here propose a conceptual framework for semi-automated extraction of epitope data embedded in scientific literature using principles from text mining and machine learning. The addition of such data will aid in the transition of biological databases to knowledgebases.

UR - http://www.scopus.com/inward/record.url?scp=84904206044&partnerID=8YFLogxK

U2 - 10.1186/1471-2164-14-S5-S14

DO - 10.1186/1471-2164-14-S5-S14

M3 - Article

C2 - 24564403

AN - SCOPUS:84904206044

SN - 1471-2164

VL - 14

JO - BMC Genomics

JF - BMC Genomics

M1 - S14

ER -

Literature classification for semi-automated updating of biological knowledgebases

Abstract

ASJC Scopus subject areas

UN SDGs

Access to Document

Other files and links

Fingerprint

Cite this