Status And Diagnostics Framework

https://git.opendaylight.org/gerrit/#/q/topic:s-n-d

Status reporting is an important part of any system. This document explores and describes various implementation options for achieving the feature.

Problem description

Today ODL does not have a centralized mechanism to do status and diagnostics of the various service modules, and have predictable system initialization. This leads to a lot of confusions on when a particular service should start acting upon the various incoming system events, because in many cases(like restarts) services end up doing premature service handling.

The feature aims at developing a status and diagnostics framework for ODL, which can :

  • Orchestrate predictable system initialization, by enabling external interfaces, including northbound and southbound interfaces depending on a set of selected services declaring their availability. This in turn can prevent the system from processing northbound (eg: OpenStack), or southbound (eg: OVSDB or OpenFlow) events prematurely before all services are ready.
  • Perform continuous monitoring of registered modules or internal services to ensure overall health of the system. This can additionally trigger alarms, or node reboots when individual services fail.

Use Cases

This feature will support following use cases:

  • Use case 1: diagstatus module uses apis to create an application which can declare the system as UP.
  • Use case 2: Core services can include existing netvirt and genius services like ELAN, L3VPN, ITM, interface-manager, DataStore and additional services may be ACL, QoS etc as needed. Applications can take necessary actions based on the aggregate system status, for eg: OpenFlow port open, OVSDB port open, and S&D status update(for consumption by other NBs such as ODL Mechanism Driver)
  • Use case 3: Registered Service Modules should expose their status to diagstatus module which inturn will use this information to expose the service status to others.
  • Use case 4: All southbound plugins should leverage the status provided by diagstatus module, as well as config file settings, to block or unblock the southbound interface
  • Use case 5: diagstatus module should monitor the health of all dependant services on a regular basis using JMX.
  • Use case 6: diagstatus module should raise traps whenever health check on a module fails.
  • Use case 7 : diagstatus module should develop the capability to do a node/cluster reboot in future for scenarios mentioned in usecase 6.
  • Use case 8 : diagstatus module should leverage on the counters support provided by infrautils to expose some debug and diagnostics counters.

Proposed change

The proposed feature adds a new module in infrautils called “diagstatus”, which allows CLI or alternative suitable interface to query the status of the services running in context of the controller (interface like Openflow, OVSDB, ELAN,ITM, IFM, Datastore etc.). This also allows individual services to push status-changes to this centralized module via suitable API-based notification. There shall be a generic set of events which application can report to the central monitoring module/service which shall be used by the service to update the latest/current status of the services.

Service startup Requirements

  • Since the statuses are stored local to the node and represents the states of individual services instances of the node, there is no data-sync-up requirements for this service
  • When the service starts-up, required local map for managing service-wise status entries shall be initialized
  • It must be ensured that the status-monitoring-service starts-up fast as service whenever is started/re-booted.

Service API requirements

Status model object encapsulating the metadata of status such as:

  • Node-name – may be this could be populated internally by framework if the node-name is available from within the framework with lesser / no external dependencies
  • Module-name – populated by status-reporting module
  • Service-name – populated by status-reporting module
  • Service-status – populated by status-reporting module
  • Current timestamp – internally populated
  • Status Description – Any specific textual content which service can add to aid better troubleshooting of reported status

Service Internal Functionality Requirements

  • Data for current status of the changes alone must be maintained. Later we can improve it to maintain history of statuses for a given service
  • Since the statuses of services are dynamic there is no persistence requirement to store the statuses
  • Status entry of given service shall be updated based on the metadata of provided by services
  • Entries for service statuses shall be created lazily - if they are not already present, as and when first API invocation is made by the application-module towards the status/health monitoring service
  • Read APIs of Monitoring-Service expose the service statuses on per cluster-node basis only. A separate module shall be developed as part of “cluster-services” user-story which can combine cross-cluster status collation
  • All output of the read-APIs shall return results as Map with URI as key and current service-status and last-update timestamp combined as value
  • In order to check the status of registered services, Status-Monitoring Service shall use standard scheduled timer service to invoke status-check callback on registered services
  • Scheduled probe timer interval shall be configurable in config.ini. Any changes to this configuration shall require the system restart

Service Shutdown Requirements

  • Currently no specific requirements around this area as restarting or node moving to quiescent state results in loss of all local data

Instrumentation Requirements

Applications must invoke status-reporting APIs as required across the lifecycle of the services in start-up, operational and graceful shutdown phases In order to emulate a simpler state-machine, we can have services report following statuses * STARTING – at the start of onSessionInitiated() on instrumented service * OPERATIONAL – at the end of onSessionInitiated() on instrumented service * ERROR – if any exception is caught during the service bring-up of if the service goes into an ERROR state dynamically * REGISTER – on successful registration of instrumented service * UNREGISTER – when a service does unregister from diagstatus on its own

Workflow

Register Service

Whenever the new service comes up, the service provider should register new service in service registry.

Report Status

Application can report their status using diagstatus APIs

Read Service Status

Whenever applications/CLI try to fetch the service status, diagstatus module will query the status through the respective OsgiService implementations exposed by each service,and an aggregated result is provided as response.

Clustering considerations

  • The CLIs/APIs provided by diagstatus module will be cluster wide.
  • Every node shall expose a Status Check MBean for querying the current status which is local to the node being queried.
  • Every node shall also expose a Clusterwide Status Check MBean for querying the clusterwide Status of services.
  • For local status CLI shall query local MBean.
  • For clusterwide status CLI shall query local MBean AS WELL AS and remote MBean instances across all current members of the cluster by accessing respective PlatformMBeanServer locally and remotely.
  • It is assumed that IP Addresses of the current nodes of cluster and standard JMX Port details are available for clusterwide MBeans
  • CLI local to any of the cluster members shall invoke clusterwide MBean on ANY ONE of current set of cluster nodes
  • Every node of cluster shall query all peer nodes using the JMX interface and consolidate the statuses reported by each node of cluster and return combined node-wise statuses across the cluster

Scale and Performance Impact

N/A as it is a new feature which does not impact any current functionality.

Known Limitations

The initial feature will not have the health check functionality. The initial feature will not have integration to infrautils counter framework for dispalying diag-counters.

Usage

Features to Install

This feature adds a new karaf feature, which is odl-infrautils-diagstatus.

JAVA API

Following are the service APIs which must be supported by the Framework :

  • Accept Service-status from services which invoke the framework
  • Get the current statuses of all services of a given cluster-node
  • A registration API to allow monitored service to register the callback
  • An interface which is to be implemented by monitored module which could be periodically invoked by Status-Monitoring framework on each target module to check status
  • Each service implements their own logic to check the local-health status using the interface and report the status

CLI

Following CLIs will be supported as part of this feature:

  • showSvcStatus - get remote service status

OSGI Services

Following osgi services will be supported as part of this feature:

  • DiagStatusService - provides APIs for application to register and unregister services and report service status
  • ServiceStatusProvider - provide information of registered service from ServiceDescriptor

Implementation

Assignee(s)

Primary assignee:
<Faseela K>
Other contributors:
<Nidhi Adhvaryu>

Work Items

  1. spec review
  2. diagstatus module bring-up
  3. API definitions
  4. Aggregate the status of services from each node
  5. Migrate All Application to Diagstatus
  6. Integrate all application
  7. Add CLI.
  8. Add UTs.
  9. Add Documentation

Dependencies

This is a new module and requires the below libraries:

  • org.apache.maven.plugins
  • com.google.code.gson
  • com.google.guava

This change is backwards compatible, so no impact on dependent projects. Projects can choose to start using this when they want.

Following projects currently depend on InfraUtils:

  • Netvirt
  • Genius

Testing

Unit Tests

Appropriate UTs will be added for the new code coming in once framework is in place.

Integration Tests

Since Component Style unit tests will be added for the feature, no need for ITs

CSIT

N/A

Documentation Impact

This will require changes to User Guide and Developer Guide.

User Guide will need to add information on how to use status-and-diag APIs and CLIs

Developer Guide will need to capture how to use the APIs of status-and-diag module to derive service specific actions. Also, the documentation needs to capture how services can expose their status via Mbean and integrate the same to status-and-diag module