Skip to content

Latest commit

 

History

History
197 lines (160 loc) · 8.1 KB

README.adoc

File metadata and controls

197 lines (160 loc) · 8.1 KB

drain java

Java CI with Gradle

Introduction

drain-java is a continuous log template miner, for each log message it extracts tokens and group them into clusters of tokens. As new log messages are added, drain-java will identify similar token and update the cluster with the new template, or simply create a new token cluster. Each time a cluster is matched a counter is incremented.

These clusters are stored in prefix tree, which is somewhat similar to a trie, but here the tree as a fixed depth in order to avoid long tree traversal. In avoiding deep trees this also helps to keep it balance.

Usage

First, Java 11 is required to run drain-java.

As a dependency

You can consume drain-java as a dependency in your project io.github.bric3.drain:drain-java-core, currently only snapshots are available by adding this repository.

repositories {
    maven {
        url("https://oss.sonatype.org/content/repositories/snapshots/")
    }
}

From command line

Since this tool is not yet released the tool needs to be built locally. Also, the built jar is not yet super user-friendly. Since it’s not a finished product, anything could change.

Example usage
$ ./gradlew build
$ java -jar tailer/build/libs/tailer-0.1.0-SNAPSHOT-all.jar -h

tail - drain
Usage: tail [-dfhV] [--verbose] [-n=NUM]
            [--parse-after-str=FIXED_STRING_SEPARATOR]
            [--parser-after-col=COLUMN] FILE
...
      FILE          log file
  -d, --drain       use DRAIN to extract log patterns
  -f, --follow      output appended data as the file grows
  -h, --help        Show this help message and exit.
  -n, --lines=NUM   output the last NUM lines, instead of the last 10; or use
                      -n 0 to output starting from beginning
      --parse-after-str=FIXED_STRING_SEPARATOR
                    when using DRAIN remove the left part of a log line up to
                      after the FIXED_STRING_SEPARATOR
      --parser-after-col=COLUMN
                    when using DRAIN remove the left part of a log line up to
                      COLUMN
  -V, --version     Print version information and exit.
      --verbose     Verbose output, mostly for DRAIN or errors
$ java -jar tailer/build/libs/tailer-0.1.0-SNAPSHOT-all.jar --version
Versioned Command 1.0
Picocli 4.6.3
JVM: 19 (Amazon.com Inc. OpenJDK 64-Bit Server VM 19+36-FR)
OS: Mac OS X 12.6 x86_64

By default, the tool act similarly to tail, and it will output the file to the stdout. The tool can follow a file if the --follow option is passed. However, when run with the --drain this tool will classify log lines using DRAIN, and will output identified clusters. Note that this tool doesn’t handle multiline log messages (like logs that contains a stacktrace).

On the SSH log data set we can use it this way.

$ java -jar build/libs/drain-java-1.0-SNAPSHOT-all.jar \
  -d \ (1)
  -n 0 \ (2)
  --parse-after-str "]: " (3)
  build/resources/test/SSH.log (4)
  1. Identify patterns in the log

  2. Starts from the beginning of the file (otherwise it starts from the last 10 lines)

  3. Remove the left part of log line (`Dec 10 06:55:46 LabSZ sshd[24200]: `), ie effectively ignoring some variable elements like the time.

  4. The log file

log pattern clusters and their occurences
---- Done processing file. Total of 655147 lines, done in 1.588 s, 51 clusters (1)
0010 (size 140768): Failed password for <*> from <*> port <*> ssh2 (2)
0009 (size 140701): pam unix(sshd:auth): authentication failure; logname= uid=0 euid=0 tty=ssh ruser= <*> <*>
0007 (size 68958): Connection closed by <*> [preauth]
0008 (size 46642): Received disconnect from <*> 11: <*> <*> <*>
0014 (size 37963): PAM service(sshd) ignoring max retries; <*> > 3
0012 (size 37298): Disconnecting: Too many authentication failures for <*> [preauth]
0013 (size 37029): PAM <*> more authentication <*> logname= uid=0 euid=0 tty=ssh ruser= <*> <*>
0011 (size 36967): message repeated <*> times: [ Failed password for <*> from <*> port <*> ssh2]
0006 (size 20241): Failed <*> for invalid user <*> from <*> port <*> ssh2
0004 (size 19852): pam unix(sshd:auth): check pass; user unknown
0001 (size 18909): reverse mapping checking getaddrinfo for <*> <*> failed - POSSIBLE BREAK-IN ATTEMPT!
0002 (size 14551): Invalid user <*> from <*>
0003 (size 14551): input userauth request: invalid user <*> [preauth]
0005 (size 14356): pam unix(sshd:auth): authentication failure; logname= uid=0 euid=0 tty=ssh ruser= <*>
0018 (size 1289): PAM <*> more authentication <*> logname= uid=0 euid=0 tty=ssh ruser= <*>
0024 (size 952): fatal: Read from socket failed: Connection reset by peer [preauth]
...
  1. 51 types of logs were identified from 655147 lines in 1.588s

  2. There was 140768 similar log messages with this pattern, with 3 positions where the token is identified as parameter <*>.

On the same dataset, the java implementation performed roughly around 10 times faster. As my implementation does not yet have masking, mask configuration was removed in the Drain3 implementation.

From Java

This tool is not yet intended to be used as a library, but for the curious the DRAIN algorythm can be used this way:

Minimal DRAIN example
var drain = Drain.drainBuilder()
                 .additionalDelimiters("_")
                 .depth(4)
                 .build()
Files.lines(Paths.get("build/resources/test/SSH.log"),
            StandardCharsets.UTF_8)
     .forEach(drain::parseLogMessage);

// do something with clusters
drain.clusters();

Status

Pieces of puzzle are coming in no particular order, I first bootstrapped the code from a simple Java file. Then I wrote in Java an implementation of Drain. Now here’s what I would like to do.

Todo
  • ❏ More unit tests

  • ✓ Wire things together

  • ❏ More documentation

  • ✓ Implement tail follow mode (currently in drain mode the whole file is read and stops once finished)

  • ❏ In follow drain mode dump clusters on forced exit (e.g. for example when hitting ctrl+c)

  • ✓ Start reading from the last x lines (like tail -n 30)

  • ❏ Implement log masking (e.g. log contain an email, or an IP address which may be considered as private data)

For later
  • ❏ Json message field extraction

  • ❏ How to handle prefixes : Dates, log level, etc. ; possibly using masking

  • ❏ Investigate marker with specific behavior, e.g. log level severity

  • ❏ Investigate log with stacktraces (likely multiline)

  • ❏ Improve handling of very long lines

  • ❏ Logback appender with micrometer counter

Motivation

I was inspired by a blog article from one of my colleague on LogMine, — many thanks to him for doing the initial research and explaining concepts --, we were both impressed by the log pattern extraction of Datadog’s Log explorer, his blog post sparked my interest.

After some discussion together, we saw that Drain was a bit superior to LogMine. Googling Drain in Java didn’t yield any result, although I certainly didn’t search exhaustively, but regardless this triggered the idea to implement this algorithm in Java.

References

The Drain port is mostly a port of Drain3 done by IBM folks (David Ohana, Moshik Hershcovitch). IBM’s Drain3 is a fork of the original work done by the LogPai team based on the paper of Pinjia He, Jieming Zhu, Zibin Zheng, and Michael R. Lyu.

I didn’t follow up on other contributors of these projects, reach out if you think you have been omitted.

For reference here’s the linked I looked at: