Logging subsystem

Logging subsystem provides facilities for capturing, recording and reporting events which occur within the OpenDaylight system. These events are the primary means of system diagnostics and troubleshooting, serving a wide audience including automated monitoring systems, operators, administrators, support personnel and development engineers.

In order to provide a ‘single system’ experience, all software components should follow same basic rules for interfacing with the logging system. While it is not practical to force these rules on the various third parties, they should to be followed by all newly-developed components.

Logging in Java

Java provides a diverse set of logging APIs and implementations. OpenDaylight has selected SLF4J API, with the implementation being provided by the Karaf container.

The primary reasons for this decision are:

  • proper split between API and implementation, allowing us to change the implementation without impacting code

  • ability to provide legacy APIs (JUL, JCL, etc.) for third-party code

Message levels

SLF4J defines five levels of logging messages, ranging from TRACE to ERROR. The guidelines for their use are very vague and in the grand UNIX tradition mix severity of the message with its granularity. The following summary defines the basic rules on when to use a specific level, who the target audience is and how any message of the specified level will be interpreted.

Note that all events with level INFO or higher present an API-like contract of the system from the integration point-of-view: if they change, third-party systems such as monitoring may need to be updated to work correctly with the new system release. The message text on these levels should be understandable to people with networking and systems administration background, so any language assuming knowledge of programming in general, or Java in particular, should be avoided if at all possible.

Messages on DEBUG and higher present are part of interface contract with support entities, e.g. if they are changed operator and troubleshooting manuals, as well as knowledge-base systems may need to be updated to correctly interpret the information conveyed. On the DEBUG level, messages may assume a slight level of familiarity with general programming concepts. Terminology specific to any programming language should be avoided if possible.

ERROR

This level serves as the general error facility. It should be used whenever the software encounters an unexpected error which prevents further processing. The reporting component may attempt recovery which may affect system availability or performance for a period of time, or it may forcefully terminate if it is unable to recover. This level is always reported for all software components.

There are three examples when this level should be used:

  1. Incorrigible internal state inconsistency, such as when a JVM reports an OutOfMemoryError. The top-level handler would log this condition and force a JVM exit, as it is not capable of continuing execution.

  2. Internal state inconsistency, which is correctable by flushing and re-creating the state. In this case the component would log an event, which would indicate what assertion has been violated and that the state has been flushed to recover.

  3. Request-level processing error, e.g. the application encounters an error which is preventing a particular request from completing, but there is no indication of systematic failure which would prevent other requests from being successfully processed.

The primary audience are monitoring systems and system operators, as the events reported have impact on operational performance of the system.

WARN

This level serves for events which indicate irregular circumstances, from which we have a clearly-defined recovery strategy which has no impact on system availability or performance as seen by the reporting component. Events on this level are usually reported.

A typical example for a candidate event is when a software component detects inconsistency within an external data feed and it performs a corrective action to compensate for it. Let’s say we process a list of key/value pairs and encounter a duplicate key: we can either overwrite the old occurance, ignore the new occurance or abort. If we take any of the first two actions, we should report a WARN event. If we take the third, we should report an ERROR event.

The primary audience of these events are automated systems, operators and administrators, as this level of messages indicate non-optimal system operation (e.g. data feeds could use normalization) or may forebode a future failure.

INFO

This level serves for events which constitute major state changes within a software component – such as initialization, shutdown, persistent resource allocation, etc. – which are part of normal operations. Events on this level are typically reported for non-library components.

Each software component should log at least four events on this level:

  1. when it starts its initialization

  2. when it becomes operational,

  3. when it starts orderly shutdown,

  4. just before it terminates normally

The primary audience of these events are operators and administrators, who use them to confirm major interactions (such as restarting components) have occurred within the system.

DEBUG

This is the coarse diagnostic level, serving for events which indicate internal state transitions and detail what processing occurs. Events on this level are usually not reported, but are enabled when a subsystem code flows need to be examined for troubleshooting purposes.

Placement and amount of events generated at this level are at the discretion of the development engineers, as both relate directly to the component logic. The amount of detail in these events has to be limited to a single line of information useful for pinning down a misbehaving software component in a seemingly normal system and should be directly cross-referenceable to either previous DEBUG events or component configuration. A hard requirement is that there has to be at least one flow control statement (such as if, for, while) or a call to a different software component between two DEBUG event triggers.

Primary audience of these events are administrators and support personnel diagnosing operational issues, mainly in real-time as they occur on the system.

TRACE

This is the fine-grained diagnostic level, serving for events which indicate internal state transitions in full detail. Events on this level are not reported, but have to be explicitly enabled and may be collected for support purposes.

Placement, amount and contents of these events is completely at the description of development engineers. These events are completely release-specific, may change even between minor releases. Examples of events reported at this level would be method entry and exit, possibly including detailed input arguments, and dumps of internal data as it is being modified.

Primary audience of these events are senior support personnel and development engineers diagnosing operational irregularities which relate directly to code structure, mainly offline after being captured on a live system.

Logger instances

Each class containing calls to the logging subsystem MUST have its own logger, which is not shared with any other class. The variable holding reference to this logger MUST me named ‘LOG’. The easiest way to ensure you have the proper logger declaration is to use the following pattern:

import org.slf4j.Logger;
import org.slf4j.LoggerFactory;

class Foo {
    private static final Logger LOG = LoggerFactory.getLogger(Foo.class);
    ...
}

Use parameterized logging

Using dynamically-constructed message strings contributes to major overhead as the message string has to be constructed before the call to logging method is performed, thus forcing overhead even if the constructed string is not used (for example DEBUG level is not enabled).

Another issue with dynamically-constructed message strings is that they cannot be easily extracted by static source code analysis – a process critical for creating message catalogue of a particular software release, which in turn is needed for things like support knowledge bases, internationalization, etc.

While the former concern is addressed by Logger classes exposing methods such as LOG.isDebugEnabled(), the second concern can only be alleviated by using explicit String literals when calling the Logger methods. The correct way to address both concerns is to use parameterized logging as described at http://www.slf4j.org/faq.html#logging_performance. The basic pattern to follow is this:

class Foo {
    private static final Logger LOG = LoggerFactory.getLogger(Foo.class);

    // GOOD: string literal, no dynamic objects
    public void good_method(Object arg) {
        LOG.debug("Method called with arg {}", arg);
    }

    // BAD: string varies with argument
    public bad_method1(Object arg) {
        LOG.debug("Method called with arg " + arg);
    }

    // BAD: code clutter
    public void bad_method2(Object arg) {
        if (LOG.isDebugEnabled()) {
            LOG.debug("Method called with arg {}", arg);
        }
    }

    // BAD: wrong level of language, this would be okay on TRACE
    public bad_method3(Object arg) {
        LOG.debug("arg is {}", arg);
    }
}

There is one thing that needs to be noted in this style, which is that logging an exception is properly supported if you supply it as the last argument, but you have to MAKE SURE IT IS NOT HINTED TO IN THE MESSAGE STRING:

class Foo {
    private static final Logger LOG = LoggerFactory.getLogger(Foo.class);

    // GOOD: note how there is no "{}" for ex
    public void good_method(Object arg) {
        try {
            doSomething(arg);
            ...
        } catch (SomeException ex) {
            LOG.warn("Failed to do something with {}, continuing", arg, ex);
        }
    }

    // BAD:
    // - exception is interpreted as an object
    // - exception chaining cause is lost
    // - stack trace is lost
    public void bad_method(Object arg) {
        try {
            doSomething(arg);
            ...
        } catch (SomeException ex) {
            LOG.warn("Failed to do something with {} because {}, continuing", arg, ex);
        }
    }
}

Avoid calls to the methods is{Trace|Debug|Info|Warn|Error}Enabled()

While it is true that methods such as isDebugEnabled() & Co. eliminate the minor overhead associated with the variadic method call, the burden on the developer is not acceptable simply because there are much better methods of automatic control of this overhead, without having any impact on the source code (or even the class files). One of them is JIT-level optimizations stemming from the ability to inline calls to LOG.debug(). The other is the set of interfaces from java.lang.instrument package, which can be used to completely eliminate the call overhead by removing all calls to LOG.debug() from the class bytecode based on the logger configuration.

The exception to this rule can be if you log something that has some cost to calculate - and if that log statement runs a lot. Something (made up) like:

for (int i = 0; i < 100000; i++) {
    if (LOG.isDebugEnabled()) {
        LOG.debug("The size is: {}", expensiveMethodToCalculateSize());
    }
}

Note that you can and always should pass Object and thus never toString() your objects passed to a Logger. For example, this is WRONG:

List<Interface> interfaces;
if (LOG.isDebugEnabled()) {
    LOG.info("Interfaces: {}", interfaces.toString());
}

and instead you can simply do:

LOG.info("Interfaces: {}", interfaces); // no need to guard this with isDebugEnabled!

Provide useful event context

Each logging call should provide useful context in which it occurred. This is not usually the case with a lot of Java-based software, notably even with some JVM implementations. Here are some typical anti-patterns which contribute to mitigated ability to diagnose problems when they happen:

class Foo {
    private static final Logger LOG = LoggerFactory.getLogger(Foo.class);

    // VERY BAD:
    // - no context provided
    // - non-constant message string
    // - assumes useful toString()
    public bad_method1(Object arg) {
        LOG.debug(arg.toString());
    }

    // VERY BAD:
    // - no context provided
    public bad_method2(Object arg) {
        LOG.debug("{}", arg);
    }

    // COMPLETELY BAD:
    // - silently ignoring errors!!!
    public bad_method3(Object arg) {
        try {
            doSomething(arg);
            ...
        } catch (SomeException ex) {
        }
    }

    // EXTREMELY BAD:
    // - message is not constant
    // - no context is provided
    // - ex.getCause() is lost
    // - call stack is lost
    public void bad_method4(Object arg) {
        try {
            doSomething(arg);
            ...
        } catch (SomeException ex) {
            LOG.warn(ex.getMessage());
        }
    }

    // EXTREMELY BAD:
    // - message is not constant
    // - no context is provided
    // - ex.getCause() is probably lost
    // - call stack is probably lost
    // - assumes useful toString()
    public void bad_method5(Object arg) {
        try {
            doSomething(arg);
            ...
        } catch (SomeException ex) {
            LOG.warn(ex.toString());
        }
    }

    // VERY BAD:
    // - no useful context is provided
    // - ex.getCause() is probably lost
    // - call stack is probably lost
    // - administrators don't know what an Exception is!
    public void bad_method6(Object arg) {
        try {
            doSomething(arg);
            ...
        } catch (SomeException ex) {
            LOG.warn("Exception {}", ex);
        }
    }
}

The proper fix for these anti-patterns is to always provide key information in the logging event:

  • what went wrong

  • how badly it went wrong

  • in case we recover, shortly describe how (especially on WARN level)

class Foo {
    private static final Logger LOG = LoggerFactory.getLogger(Foo.class);

    // GOOD:
    // - string literal
    // - we explain what we tried to do
    // - we pass along information we have about the failure
    // - we explain that we recovered from the failure
    public void good_method1(Object arg) {
        try {
            doSomething(arg);
            ...
        } catch (SomeException ex) {
            LOG.warn("Failed to do something with {}, ignoring it", arg, ex);
        }
    }

    // GOOD:
    // - string literal
    // - we explain what we tried to do
    // - we pass along information we have about the failure
    // - we escalate the failure to our caller
    // - we also 'chain' the exception so it is not lost and can be
    // correlated
    public void good_method2(Object arg) {
        try {
            doSomething(arg);
            ...
        } catch (SomeException ex) {
            LOG.error("Failed to do something with {}", arg, ex);
            throw new RuntimeException("Failed to do something", ex);
        }
    }
}