Monitoring Akka & Play

The how and the what

Jan Macháček @honzam399  |   Alex Lashford @alexlashford

www.eigengo.com/monitoring-akka-play

So, tell me again

 

 

 

 

 

 

how everything's fine in production!

Worked fine in dev—ops problem now!

We create actor systems in Akka (perhaps with REST APIs in Spray); modern web applications in Play, and we all like to think that these applications will attract thousands billions of users.

What to Monitor

You want to see inside the Akka, Play Framework or Spray and know what is happening in the system's components.

 

  • Number / rates of messages
  • Sizes of queues
  • Duration of messages in queue
  • Time it takes to execute the receive PF
  • Number / rates of failures
  • The threads in the thread pools

Typesafe Console!

Typesafe Console!

Brilliant, looks great, under low load its really impressive. We loved it, but ...

 

  • Production—unusable: too much data, memory usage, I/O
  • Development—great: gives in–depth view of the actor system

Reactive Monitor

We wanted to record just enough information. Too much slows down the monitored system, too little lets events go unnoticed.

And we wanted to have a flexible mechanism to deliver the collected monitoring information.

 

  • Actor creation & destruction.
  • Message types, message rates, failures and performance at the actor level
  • Queue size at the (local) actor level
  • The number of available and running threads in the ThreadPools

Reactive Monitor

We monitor


val system = ActorSystem()
val x = system.actorOf(Props[XActor])
val y = system.actorOf(Props[YActor])
x ! "foo"
y ! "foo"
y ! 42
          

We monitor


x ! "foo"; y ! "foo"; y ! 42
          

We monitor


val system = ActorSystem()
val x = system.actorOf(Props[XActor])
val y = system.actorOf(Props[YActor])
x ! "foo"
y ! "foo"
y ! 42
          

AspectJ

It allows us to instrument Akka's and Play's bytecode as it is loaded by the ClassLoaders. We can do anything we like, as long as the instrumented bytecode verifies.

AspectJ

To turn on LTW, set the -javaagent JVM parameter, and add /META-INF/aop.xml.


<aspectj>
    <aspects>
        <aspect name="ActorCellMonitoringAspect"/>
        <aspect name="DispatcherMonitoringAspect"/>
    </aspects>
    <weaver options="-verbose -showWeaveInfo">
        <include within="akka.actor.*"/>
        <include within="akka.dispatch.*"/>
        <include within="scala.concurrent.*"/>
        <include within="java.util.concurrent.*"/>
    </weaver>
</aspectj>
          

AspectJ


public aspect ActorCellMonitoringAspect {

  Object around(ActorCell actorCell, Object msg) : 
    call(* a.a.ActorCell.receiveMessage(..)) 
    && args(msg) && target(actorCell) {
      // check configuration to see if we're interested
      // publish 'delivered' events
      // measuring performance
      Object result = proceed(actorCell, msg);
      // return null would do, but we are _proper_.
      return result;
  }
}
          

AspectJ


public aspect DispatcherMonitoringAspect {

  before(ExecutorService es) : 
    call(* j.u.c.ExecutorService+.execute(..)) 
    && target(es) {
      // publish 'active threads' count
      // publish 'running threads' count
    }

}
          

Agents

  • Akka - measuring actor instances, their performance, failures, messages, message queues and threads
  • Spray - measuring the HTTP requests, and bytes transferred
  • Play - measuring the HTTP requests, controller performance and failures

Akka Agent

We have some specific instrumenation around some of the useful intractions in the Akka framework, including support for both the Java and Scala APIs

 

  • Actor instances
  • Time in receive / onRecieve(Object)
  • Mailbox size
  • Message time in mailbox
  • Exceptions
  • Executor / Thread data

Play Framework Agent

Instrumenation around some of the useful intractions in the Play Framework, including support for both the Java and Scala APIs

 

  • HTTP requests (grouped by route)
  • Exceptions

Spray Agent

Provides instrumenation for the Spray libraries

 

  • HTTP requests
  • Overall request processing time
  • Marshaller / unmarshaller performance
  • Bytes transferred

Output modules

The aspects load configurable output modules. We scratched our own itch by implementing the statsd / Datadog output module, but we have two more!

 

  • statsd (with Datadog extensions)
  • DTrace (on Solaris, dummy elsewhere)
  • Codahale Metrics

Statsd

Really simple transport: strings up to 508 bytes long over UDP.


aspect ':' value '|' type '|#' tag1 ',' tag2 ',' ...
          

For example, in our application


akka.actor.delivered:50|c|#↩
  akka://default/user/x,akka.type:default.org...XActor
akka.actor.delivered.Integer:50|c|#↩
  akka://default/user/x,akka.type:default.org...XActor
akka.actor.duration:99|ms|#↩
  akka://default/user/x,akka.type:default.org...XActor
          

Statsd

Most statsd tools can show very pretty charts.

Statsd


class StatsdActor(remote: InetSocketAddress, 
                  prefix: String) extends Actor {
  IO(Udp) ! Udp.SimpleSender
  def receive: Receive = {
    case Udp.SimpleSenderReady =>
      context.become(ready(sender))
  }
  def ready(send: ActorRef): Receive = {
    case stat: StatsdStatistic =>
      val payload = toByteString(stat, prefix)
      send ! Udp.Send(payload, remote)
  }
}
          

DTrace

DTrace is kernel-level, production-ready, minimal-impact tracing mechanism.


$ sudo dtrace -l
ID   PROVIDER            MODULE     NAME
 1     dtrace                       BEGIN
 2     dtrace                       END
 3     dtrace                       ERROR
 4 nfsmapid804          nfsmapid    daemon-domain
 5 kerberos800    mech_krb5.so.1    krb_ap_rep-make
   ...
33 akka2083         java_tracing    all-gauges
34 akka2083         java_tracing    execution-time
35 akka2083         java_tracing    all-counters          
          

DTrace

The actual interface relies on the internal tracing API that the Oracle JDK and OpenJDK provides.


@ProviderName("akka")
public interface DtraceCounterProvider 
    extends com.sun.tracing.Provider {

    @FunctionName("Receive execution time")
    @ProbeName("execution-time")
    void executionTime(String name, int length, 
                       int duration);

    ...
}
          

DTrace

The D program needs to copy in data from kernel space to the user space where the program runs.


akka$1:::exeucution-time {
  printf("execution time: %s -> %d", 
    stringof(copyin(arg0, arg1 + 1)), arg2);
}

akka$1:::all-counters {
  printf("counter: %s -> %d", 
    stringof(copyin(arg0, arg1 + 1)), arg2);
}
          

$ sudo dtrace -s script.d `pgrep java`
          

Codahale Metrics

A metrics library in Java (and Scala) that allows us to deliver measurement types similar to what statsd can do.

Unlike statsd, Metrics is the entire monitoing framework. It processes the information it receives, performs the calculations we want, and exposes the measurements.

Codahale Metrics


class MetricsCounterInterface 
  extends CounterInterface with MetricsHandler {

  def registry: MetricRegistry = ...
  def marshaller: NameMarshaller = ...

  def recordExecutionTime(aspect: String, 
                          duration: Int, 
                          tags: String*): Unit =
    updateExecutionTime(aspect, duration, tags)
  
  ...
}
          

Configuration

The aspects need configuration for the agents and outputs.

 

  • Decide which actors to monitor
  • How to aggregate the monitoring data
  • How to deliver the monitoring data

Configuration

Typesafe Config-style files at well-known locations configure the agent and output modules.

 

  • /META-INF/monitor/agent.conf for the agent
  • /META-INF/monitor/output.conf for the output

Configuration


org.eigengo.monitor.agent {
  output.class: "org...StatsdCounterInterface"

  akka {
    included: [
      "akka:default.org...SimpleActor",
      "akka://default/user/x"
    ]
    sampling: [
      { rate: 5, for: [ "akka://default/user/*" ] }
    ]
  }
}          
          

Configuration


org.eigengo.monitor.output.statsd {
    prefix: ""
    remoteAddress: "localhost"
    remotePort: 8125
    refresh: 5
    initialDelay: 5
    constantTags: []
}
          

org.eigengo.monitor.output.codahalemetrics {
    registry-class: "org...DefaultRegistryProvider"
    naming-class:   "org...DefaultNameMarshaller"
    prefix: ""
    refresh: 5
}
          

Into the battle!

  • Dependencies for the agents and output modules you require
    
    "org.eigengo.monitor" % "agent-akka"    % "0.4",
    "org.eigengo.monitor" % "output-statsd" % "0.4"
                  
  • Agent configuration in /META-INF/monitor/agent.conf
    and—if required—
    Output configuration in /META-INF/monitor/output.conf
  • /META-INF/aop.xml to give configuration to the weaver

Demo

Let's add monitoring to statsd & Datadog to an Akka application

War stories?



What have we learnt in building the Reactive Monitor?

 

What have we learnt in building a large Akka system?

Analysed the architecture

CPU-intensive (XML parsing)

Non-reactive / blocking third-party libraries (DB drivers, ....)

Reactive libraries (Async HTTP)

Bulkhead the operations

Isolated the actor system into functional areas:

 

  • Legacy blocking code - PinnedDispatcher
  • CPU intensive code - BalancingDispatcher
  • Remaining reactive code - fork-join-executor / executor


Creating different dispatchers for the unruly code

Viz tuning-dispatchers-in-akka-applications on letitcrash.com.

Nearly over!

Remember...

  • Visibility is king
  • Monitor only what you need
  • Don't wait for the big boom!
  • Bulkhead your application

Thank you!


Law of Murphy for devops: if thing can able go wrong, is mean is already wrong but you not have Nagios alert of it yet.