Metrics
Colossus uses the (currently nameless) metrics library.
Introduction
High-throughput Colossus services can serve hundreds of thousands of requests per second, which can easily translate to millions of recordable events per second. The Metrics library provides a way to work with metrics with as little overhead as possible. It is configurable via a typesafe config. See reference config The Colossus server and clients define several metrics be default.
Collection Intervals
Collection intervals define how often the raw data from all event collectors are snapshotted and merged together into a single database of values.
The provided config creates a metric system with both 1 second and 1 minute collection intervals. The 1 second interval is used to show real-time metrics where rates and histograms are all showing values reflective of the last second of activity, whereas the 1 minute intervals are used for reporting values to an external database in which case the values reported are all reflective of the last minute of activity.
Default Metrics in Colossus
Running the Colossus Client and Colossus Server makes several metrics available in a default service, the server is prefixed with the service name, the client is prefixed with the service name AND the client name
Server
metric name | Description | Tags |
---|---|---|
system/fd_count | the open file descriptor counts | N/A |
system/gc/cycles | number of garbage collection cycles | type : ConcurrentMarkSweep, ParNew |
system/gc/msec | the milliseconds it takes for garbage collection | type : ConcurrentMarkSweep, ParNew |
system/memory | the total memory availble to the JVM | type: free, max, allocated |
requests | the amount of request to the service | status_code: http code(example: 201, 400) |
concurrent_requests | the amount of request to the service in a time interval | N/A |
requests_per_connection | the amount of different async requests made in a single request | label: max, mean, etc |
errors | the amount of exceptionst that have bubbled up in the service | type: exception name |
latency | the time (in ms) it takes to complete a request | label:max,mean, etc status_code:200, 404, etc |
worker/event_loops | how many event loops were selected in a given interval | worker: the worker Id for a given event loop |
worker/connections | the connections registrered for a new or reconnecting client | worker: the worker Id for a given event loop |
worker/rejected_connections | the number of connections that weren’t able to be created by the worker | server: servername, worker: worker id |
connections | the amount of connections that are used for work | N/A |
refused_connections | the amount of connections that were rejected due to the connection limiter | N/A |
connects | the amount of connections attempted, this includes both connections and refused_connections | N/A |
closed | the number of connections that are closed | cause: string for the connection closing |
highwaters | the amount threads that have changed connection state (normal or highwater_ | N/A |
Client
metric name | Description | Tags |
---|---|---|
requests | the amount of requests the client made | client_port: the port in use |
errors | the exceptions that bubbled up to the client | type: the exception |
dropped_requests | the number of requests that cannot be sent back to the client | client_port: the port in use |
connection_failures | if a connection failed to an external host | client_port: the port in use |
disconnects | if a connection is lost due to being closed, that isn’t a connection failure (example: connection closed) | client_port |
latency | the time (in ms) it takes to complete a request, this includes time in the queue | label:max,mean, etc client_port: the port in use |
transit_time | the time (in ms) it takes to complete a request, this does NOT include queue time | label:max,mean, etc client_port: the port in use |
queue_time | the time between a request is made to when the call is made | label:max,mean, etc client_port: the port in use |
Metric Filters
Metric Addresses and Tags
Every metric has a url-like address used to identify it.
One of the most important aspects of metrics is that values can be tagged. For example, a rate can track hits to API endpoints, using a “endpoint” tag to break down the usage by endpoint.
val rate = Rate("my-rate")
rate.hit(Map("endpoint" -> "foo"))
rate.hit(Map("endpoint" -> "bar"))
rate.hit(Map("endpoint" -> "bar"))
This will in turn lead to
endpoint | value |
---|---|
foo | 1 |
bar | 2 |
Getting Started
If you are using colossus, it depends on the metrics library and pulls it in. Otherwise you must add the following to your build.sbt/Build.scala
libraryDependencies += "com.tumblr" %% "colossus-metrics" % "LATEST_VERSION"
From there, the only required step is to spin up a MetricSystem
.
implicit val actorSystem = ActorSystem()
implicit val metricSystem = MetricSystem("name")
val rate = Rate("my-rate")
rate.hit()
Namespaces
When creating a new collector, you always need an implicit MetricNamespace
in scope. The MetricSystem
itself acts as the root namespace, but you can use it to create sub-namespaces:
implicit val actorSystem = ActorSystem()
val metricSystem = MetricSystem("name")
implicit val namespace = metricSystem / "foo" / "bar"
// this rate will have the address "/foo/bar/baz"
val rate = Rate("baz")
Metric Tick Complete example
The example below illustrates creating a service with a url called badUrl and doing a metric tick when an endpoint is hit.
implicit val actorSystem = ActorSystem()
val metricSystem = MetricSystem("badurl")
implicit val ioSystem = IOSystem("badurl", None, metricSystem)
implicit val metricNamespace = ioSystem.metrics
val badUrl = Rate("badurl")
HttpServer.start("example-server", 9000) { initContext =>
new Initializer(initContext) {
override def onConnect: RequestHandlerFactory =
serverContext =>
new RequestHandler(serverContext) {
override def handle: PartialHandler[Http] = {
case request @ Get on Root / "url" =>
//bad url, metric tick
if (!request.head.parameters.contains("myurl")) {
badUrl.hit()
}
Callback.successful(request.ok("received response"))
}
}
}
}
Available Collectors
All collectors are thread-safe.
Counter
A counter simply allows you to set, increment, and decrement values:
val counter = Counter("my-counter")
counter.set(Map("foo" -> "bar"), 2)
counter.increment(Map("foo" -> "bar"))
Rate
A rate is like a counter, but resets at the beginning of each collection interval. Rates are also tracked for every interval.
val rate = Rate("my-rate")
rate.hit()
Histogram
Histograms can gather statistical data about values added to them. By default, histograms are setup with various percentiles defined, the values are sorted and placed in the appropriate percentiles.
val hist = Histogram("my-histogram", percentiles = List(0.5, 0.99, 0.999))
hist.add(12)
hist.add(1)
hist.add(98765)
Metric Reporting
Metric reporters are used to take the periodically generated snapshots of metrics and report them to an external system. A MetricSender
is the interface to the remote system and is responsible for properly formatting metrics and handling all communication. Colossus currently has native support for OpenTSDB.
implicit val actorSystem = ActorSystem()
implicit val metricSystem = MetricSystem("name")
val reporterConfig = MetricReporterConfig(
metricSenders = Seq(OpenTsdbSender("host", 123)),
filters = MetricReporterFilter.All
)
// reporter must be attached to a specific collection interval.
metricSystem.collectionIntervals.get(1.minute).foreach(_.report(reporterConfig))
// now once per minute the value of this rate and any other metrics will be reported
val rate = Rate("myrate")