Metrics

Colossus uses the (currently nameless) metrics library.

Introduction

High-throughput Colossus services can serve hundreds of thousands of requests per second, which can easily translate to millions of recordable events per second. The Metrics library provides a way to work with metrics with as little overhead as possible. It is configurable via a typesafe config. See reference config The Colossus server and clients define several metrics be default.

Collection Intervals

Collection intervals define how often the raw data from all event collectors are snapshotted and merged together into a single database of values.

The provided config creates a metric system with both 1 second and 1 minute collection intervals. The 1 second interval is used to show real-time metrics where rates and histograms are all showing values reflective of the last second of activity, whereas the 1 minute intervals are used for reporting values to an external database in which case the values reported are all reflective of the last minute of activity.

Default Metrics in Colossus

Running the Colossus Client and Colossus Server makes several metrics available in a default service, the server is prefixed with the service name, the client is prefixed with the service name AND the client name

Server

metric name	Description	Tags
system/fd_count	the open file descriptor counts	N/A
system/gc/cycles	number of garbage collection cycles	type : ConcurrentMarkSweep, ParNew
system/gc/msec	the milliseconds it takes for garbage collection	type : ConcurrentMarkSweep, ParNew
system/memory	the total memory availble to the JVM	type: free, max, allocated
requests	the amount of request to the service	status_code: http code(example: 201, 400)
concurrent_requests	the amount of request to the service in a time interval	N/A
requests_per_connection	the amount of different async requests made in a single request	label: max, mean, etc
errors	the amount of exceptionst that have bubbled up in the service	type: exception name
latency	the time (in ms) it takes to complete a request	label:max,mean, etc status_code:200, 404, etc
worker/event_loops	how many event loops were selected in a given interval	worker: the worker Id for a given event loop
worker/connections	the connections registrered for a new or reconnecting client	worker: the worker Id for a given event loop
worker/rejected_connections	the number of connections that weren’t able to be created by the worker	server: servername, worker: worker id
connections	the amount of connections that are used for work	N/A
refused_connections	the amount of connections that were rejected due to the connection limiter	N/A
connects	the amount of connections attempted, this includes both connections and refused_connections	N/A
closed	the number of connections that are closed	cause: string for the connection closing
highwaters	the amount threads that have changed connection state (normal or highwater_	N/A

Client

metric name	Description	Tags
requests	the amount of requests the client made	client_port: the port in use
errors	the exceptions that bubbled up to the client	type: the exception
dropped_requests	the number of requests that cannot be sent back to the client	client_port: the port in use
connection_failures	if a connection failed to an external host	client_port: the port in use
disconnects	if a connection is lost due to being closed, that isn’t a connection failure (example: connection closed)	client_port
latency	the time (in ms) it takes to complete a request, this includes time in the queue	label:max,mean, etc client_port: the port in use
transit_time	the time (in ms) it takes to complete a request, this does NOT include queue time	label:max,mean, etc client_port: the port in use
queue_time	the time between a request is made to when the call is made	label:max,mean, etc client_port: the port in use

Metric Filters

Metric Addresses and Tags

Every metric has a url-like address used to identify it.

One of the most important aspects of metrics is that values can be tagged. For example, a rate can track hits to API endpoints, using a “endpoint” tag to break down the usage by endpoint.

val rate = Rate("my-rate")
rate.hit(Map("endpoint" -> "foo"))
rate.hit(Map("endpoint" -> "bar"))
rate.hit(Map("endpoint" -> "bar"))

This will in turn lead to

endpoint	value
foo	1
bar	2

Getting Started

If you are using colossus, it depends on the metrics library and pulls it in. Otherwise you must add the following to your build.sbt/Build.scala

libraryDependencies += "com.tumblr" %% "colossus-metrics" % "LATEST_VERSION"

From there, the only required step is to spin up a MetricSystem.

implicit val actorSystem  = ActorSystem()
implicit val metricSystem = MetricSystem("name")

val rate = Rate("my-rate")
rate.hit()

Namespaces

When creating a new collector, you always need an implicit MetricNamespace in scope. The MetricSystem itself acts as the root namespace, but you can use it to create sub-namespaces:

implicit val actorSystem = ActorSystem()
val metricSystem         = MetricSystem("name")

implicit val namespace = metricSystem / "foo" / "bar"

// this rate will have the address "/foo/bar/baz"
val rate = Rate("baz")

Metric Tick Complete example

The example below illustrates creating a service with a url called badUrl and doing a metric tick when an endpoint is hit.

implicit val actorSystem = ActorSystem()

val metricSystem = MetricSystem("badurl")

implicit val ioSystem = IOSystem("badurl", None, metricSystem)

implicit val metricNamespace = ioSystem.metrics

val badUrl = Rate("badurl")

HttpServer.start("example-server", 9000) { initContext =>
  new Initializer(initContext) {
    override def onConnect: RequestHandlerFactory =
      serverContext =>
        new RequestHandler(serverContext) {
          override def handle: PartialHandler[Http] = {
            case request @ Get on Root / "url" =>
              //bad url, metric tick
              if (!request.head.parameters.contains("myurl")) {
                badUrl.hit()
              }
              Callback.successful(request.ok("received response"))
          }
      }
  }
}

Available Collectors

All collectors are thread-safe.

Counter

A counter simply allows you to set, increment, and decrement values:

val counter = Counter("my-counter")
counter.set(Map("foo"       -> "bar"), 2)
counter.increment(Map("foo" -> "bar"))

Rate

A rate is like a counter, but resets at the beginning of each collection interval. Rates are also tracked for every interval.

val rate = Rate("my-rate")
rate.hit()

Histogram

Histograms can gather statistical data about values added to them. By default, histograms are setup with various percentiles defined, the values are sorted and placed in the appropriate percentiles.

val hist = Histogram("my-histogram", percentiles = List(0.5, 0.99, 0.999))
hist.add(12)
hist.add(1)
hist.add(98765)

Metric Reporting

Metric reporters are used to take the periodically generated snapshots of metrics and report them to an external system. A MetricSender is the interface to the remote system and is responsible for properly formatting metrics and handling all communication. Colossus currently has native support for OpenTSDB.

implicit val actorSystem  = ActorSystem()
implicit val metricSystem = MetricSystem("name")

val reporterConfig = MetricReporterConfig(
  metricSenders = Seq(OpenTsdbSender("host", 123)),
  filters = MetricReporterFilter.All
)

// reporter must be attached to a specific collection interval.
metricSystem.collectionIntervals.get(1.minute).foreach(_.report(reporterConfig))

// now once per minute the value of this rate and any other metrics will be reported
val rate = Rate("myrate")

The source code for this page can be found here.

Next: Tasks