Log-based anomaly detection for distributed systems: State of the art, industry experience, and open issues

Xinjie Wei; Jie Wang; Chang ai Sun; Dave Towey; Shoufeng Zhang; Wanqing Zuo; Yiming Yu; Ruoyi Ruan; Guyang Song

doi:10.1002/smr.2650

Log-based anomaly detection for distributed systems: State of the art, industry experience, and open issues

Xinjie Wei, Jie Wang, Chang ai Sun, Dave Towey, Shoufeng Zhang, Wanqing Zuo, Yiming Yu, Ruoyi Ruan, Guyang Song

School of Computer Science

Research output: Journal Publication › Article › peer-review

1 Citation (Scopus)

Abstract

Distributed systems have been widely used in many safety-critical areas. Any abnormalities (e.g., service interruption or service quality degradation) could lead to application crashes or decrease user satisfaction. These things may cause serious economic losses. Among the various quality assurance approaches for distributed systems, log-based anomaly detection (LAD) has become a popular research topic. Its popularity relates to system logs being able to record and reveal important run-time information. This paper presents a general LAD framework for distributed systems. Log grouping and feature-pattern mining are two crucial LAD components that impact on the anomaly-detection effectiveness. We also present a systematic survey of techniques in these two directions; propose classification frameworks for log grouping and feature patterns; and summarize four log-grouping techniques and five feature patterns (which refer to invariant relationships among logs that can be used for anomaly detection). To evaluate their applicability, we report on the findings when applying existing techniques to Ray, a popular industrial distributed system. Based on these findings, several open issues are identified, which provide potential guidance for future research and development.

Original language	English
Article number	e2650
Journal	Journal of software: Evolution and Process
Volume	36
Issue number	8
DOIs	https://doi.org/10.1002/smr.2650
Publication status	Accepted/In press - 2024

Keywords

distributed systems
industry experience
log-based anomaly detection
quality assurance

ASJC Scopus subject areas

Software

UN SDGs

This output contributes to the following UN Sustainable Development Goals (SDGs)

Access to Document

10.1002/smr.2650

Cite this

@article{fb7c84bef0494e6aa895bf748eae1ad6,

title = "Log-based anomaly detection for distributed systems: State of the art, industry experience, and open issues",

abstract = "Distributed systems have been widely used in many safety-critical areas. Any abnormalities (e.g., service interruption or service quality degradation) could lead to application crashes or decrease user satisfaction. These things may cause serious economic losses. Among the various quality assurance approaches for distributed systems, log-based anomaly detection (LAD) has become a popular research topic. Its popularity relates to system logs being able to record and reveal important run-time information. This paper presents a general LAD framework for distributed systems. Log grouping and feature-pattern mining are two crucial LAD components that impact on the anomaly-detection effectiveness. We also present a systematic survey of techniques in these two directions; propose classification frameworks for log grouping and feature patterns; and summarize four log-grouping techniques and five feature patterns (which refer to invariant relationships among logs that can be used for anomaly detection). To evaluate their applicability, we report on the findings when applying existing techniques to Ray, a popular industrial distributed system. Based on these findings, several open issues are identified, which provide potential guidance for future research and development.",

keywords = "distributed systems, industry experience, log-based anomaly detection, quality assurance",

author = "Xinjie Wei and Jie Wang and Sun, \{Chang ai\} and Dave Towey and Shoufeng Zhang and Wanqing Zuo and Yiming Yu and Ruoyi Ruan and Guyang Song",

note = "Publisher Copyright: {\textcopyright} 2024 John Wiley \& Sons Ltd.",

year = "2024",

doi = "10.1002/smr.2650",

language = "English",

volume = "36",

journal = "Journal of software: Evolution and Process",

issn = "1532-060X",

publisher = "John Wiley and Sons Ltd",

number = "8",

}

TY - JOUR

T1 - Log-based anomaly detection for distributed systems

T2 - State of the art, industry experience, and open issues

AU - Wei, Xinjie

AU - Wang, Jie

AU - Sun, Chang ai

AU - Towey, Dave

AU - Zhang, Shoufeng

AU - Zuo, Wanqing

AU - Yu, Yiming

AU - Ruan, Ruoyi

AU - Song, Guyang

PY - 2024

Y1 - 2024

N2 - Distributed systems have been widely used in many safety-critical areas. Any abnormalities (e.g., service interruption or service quality degradation) could lead to application crashes or decrease user satisfaction. These things may cause serious economic losses. Among the various quality assurance approaches for distributed systems, log-based anomaly detection (LAD) has become a popular research topic. Its popularity relates to system logs being able to record and reveal important run-time information. This paper presents a general LAD framework for distributed systems. Log grouping and feature-pattern mining are two crucial LAD components that impact on the anomaly-detection effectiveness. We also present a systematic survey of techniques in these two directions; propose classification frameworks for log grouping and feature patterns; and summarize four log-grouping techniques and five feature patterns (which refer to invariant relationships among logs that can be used for anomaly detection). To evaluate their applicability, we report on the findings when applying existing techniques to Ray, a popular industrial distributed system. Based on these findings, several open issues are identified, which provide potential guidance for future research and development.

AB - Distributed systems have been widely used in many safety-critical areas. Any abnormalities (e.g., service interruption or service quality degradation) could lead to application crashes or decrease user satisfaction. These things may cause serious economic losses. Among the various quality assurance approaches for distributed systems, log-based anomaly detection (LAD) has become a popular research topic. Its popularity relates to system logs being able to record and reveal important run-time information. This paper presents a general LAD framework for distributed systems. Log grouping and feature-pattern mining are two crucial LAD components that impact on the anomaly-detection effectiveness. We also present a systematic survey of techniques in these two directions; propose classification frameworks for log grouping and feature patterns; and summarize four log-grouping techniques and five feature patterns (which refer to invariant relationships among logs that can be used for anomaly detection). To evaluate their applicability, we report on the findings when applying existing techniques to Ray, a popular industrial distributed system. Based on these findings, several open issues are identified, which provide potential guidance for future research and development.

KW - distributed systems

KW - industry experience

KW - log-based anomaly detection

KW - quality assurance

UR - http://www.scopus.com/inward/record.url?scp=85184441830&partnerID=8YFLogxK

U2 - 10.1002/smr.2650

DO - 10.1002/smr.2650

M3 - Article

AN - SCOPUS:85184441830

SN - 1532-060X

VL - 36

JO - Journal of software: Evolution and Process

JF - Journal of software: Evolution and Process

IS - 8

M1 - e2650

ER -

Log-based anomaly detection for distributed systems: State of the art, industry experience, and open issues

Abstract

Keywords

ASJC Scopus subject areas

UN SDGs

Access to Document

Other files and links

Fingerprint

Cite this