Stuff to know before getting started
The Collins data model
Collins models your infrastructure through the use of "assets". An asset can be anything you want, but the pre-configured asset types are Servers, Chassis, Switch, Router, and Configuration. By composing these objects with a rich tagging/attribute system, IPAM, status and state transitions, and logging system, it allows you to dynamically model and automate your infrastructure, no matter the size of your organization.
An asset describes a piece of your infrastructure
Assets are simple representations of a part of your infrastructure topology. They can be things like Routers and Switches, or Servers, Chassis, Configuration, Datacenter, or even user-defined types for your specific needs. Assets primarily consist of:
New asset types can be created via the Asset Type API. The preconfigured types that Collins ships with are:
Status and state describe the current phase of the lifecycle of an asset.
The lifecycle (from birth to death) of an asset are described in terms of its status. The possible status values are fixed and can not be managed via the API. While all available status values are listed below, the descriptions given are primarily indiciative of their meanings for a server. A non-server asset type such as a configuration may only ever be allocate d or decommissioned, for instance. The status values described below also give some specific insight into the Tumblr intake process for hardware.
Host not yet ready for use. It has been powered on and entered in Collins but the automated induction process is still being performed.
Host has completed the automated induction process and is waiting for an onsite tech to complete physical intake
Host has completed intake process and is ready for use (eg available resource for provisioning into a role)
Host has started provisioning process but has not yet completed it
Host has finished provisioning and is awaiting final automated verification
This asset is in what should likely be considered a production state
Asset is no longer needed and is awaiting decommissioning (typically only associated with SoftLayer and other transient cloud-provided assets)
Asset has completed the outtake process and can no longer be managed
Asset is undergoing some kind of maintenance and should not be considered for production use
The status transition should not generally happen by hand. Automated processes should drive status changes, not people. In fact, the collins UI only allows you to change the status by taking an action (e.g. cancelling an asset or putting it into maintenance).
While the status of an asset describes where it is in a discrete lifecycle,
the state describes a lifecycle specific to a status. For example, a server
that is in maintenance may have a state of HARDWARE_PROBLEM
or
HARDWARE_UPGRADE
. Also note that those states are not appropriate
for healthy (non-maintenance) assets, and so these states are restricted to
assets with a status of Maintenance
. New states may be defined
with this API.
A state can be either a system state, or a non-system state. System states can
not be modified or destroyed. Non-system states can be modified and destroyed.
Via the API you can only create non-system states, although support for adding
system states may be added in the future. A state can be bound to a status
(such as the case of HARDWARE_PROBLEM
), or can be used with any
status (such as the case of RUNNING
). The out of the box available
states are described below.
Status | State Label | State Name | State Description |
Any | Failed | FAILED | A service in this state has encountered a problem and may not be operational. It cannot be started nor stopped. |
Any | New | NEW | A service in this state is inactive. It does minimal work and consumes minimal resources. |
Any | Running | RUNNING | A service in this state is operational. |
Any | Starting | STARTING | A service in this state is transitioning to Running. |
Any | Stopping | STOPPING | A service in this state is transitioning to Terminated. |
Any | Terminated | TERMINATED | A service in this state has completed execution normally. It does minimal work and consumes minimal resources. |
Maintenance | Hardware Problem | HARDWARE_PROBLEM | An asset is experiencing a non-IPMI issue and needs to be examined. It needs investigation. |
Maintenance | Hardware Testing | HW_TESTING | Performing some testing that requires putting the asset into a maintenance state. |
Maintenance | Hardware Upgrade | HARDWARE_UPGRADE | An asset is in need or in process of having hardware upgraded. |
Maintenance | IPMI Problem | IPMI_PROBLEM | An asset is experiencing IPMI issues and needs to be examined. It needs investigation. |
Maintenance | Maintenance NOOP | MAINT_NOOP | Doing nothing, bouncing this through maintenance for my own selfish reasons. |
Maintenance | Network Problem | NETWORK_PROBLEM | An asset is experiencing a network problem that may or may not be hardware related. It needs investigation. |
Maintenance | Relocation | RELOCATION | An asset is being physically relocated. |
A few useful examples of user-defined states are as follows:
Provisioned:PENDING_ALLOCATION
could represent the state an asset
enters when it has finished its automated provisioning (i.e. base OS imaging,
configuration management application, and application deployments), but has not
yet been validated to be healthy in external monitoring systems. For example,
an "Allocator" script could collect all assets that are Provisioned:PENDING_ALLOCATION
and perform validation checks on them (are they in DNS? are they pingable? SSH?
are metrics being collected in Prometheus? Are all alerts in Prometheus green?).
When health is verified, then the Allocator can move the asset into Allocated
where it will be automatically pulled into service discovery/load balancers.
Another useful state could be Allocated:DRAINING
. This state could
reflect an asset is currently taking production traffic and should continue to
be monitored, but signals to loadbalancers and resource managers to stop routing
new jobs/requests to this asset. This is especially useful for Hadoop Datanodes.
Once the asset is fully drained of jobs, an automated process could move it from
Allocated:DRAINING
into a Maintenance
state to pull it
out of rotation fully. At Tumblr, we use these states to programmatically generate
HDFS dfs.hosts
and dfs.exclude
lists to dynamically
manage the participating members of the Hadoop cluster, and automatically Decommission
any asset in Allocated:DRAINING
.
An audit trail with an API
Every modification or lifecycle event that occurs with an asset is logged, along with who made the change and the time of the change. If a tag is modified (and not encrypted), the previous and new value are both stored. Logs can be searched via the API and can be viewed on the web as well. Logs are immutable but can be created via the API. Below is a list of log levels (based on syslog) and descriptions.
Level | Description |
EMERGENCY | A "panic" condition - notify all tech staff on call? (earthquake? tornado?) - affects multiple apps/servers/sites... |
ALERT | Should be corrected immediately - notify staff who can fix the problem - example is loss of backup ISP connection |
CRITICAL | Should be corrected immediately, but indicates failure in a primary system - fix CRITICAL problems before ALERT - example is loss of primary ISP connection |
ERROR | Non-urgent failures - these should be relayed to developers or admins; each item must be resolved within a given time |
WARNING | Warning messages - not an error, but indication that an error will occur if action is not taken, e.g. file system 85% full - each item must be resolved within a given time |
NOTICE | Events that are unusual but not error conditions - might be summarized in an email to developers or admins to spot potential problems - no immediate action required |
INFORMATIONAL | Normal operational messages - may be harvested for reporting, measuring throughput, etc - no action required |
DEBUG | Info useful to developers for debugging the application, not useful during operations |
NOTE | Creates by users via the web UI, can be any kind of message |
System logs (messages that aren't specific to any particular kind of asset) can only be logged internally by collins.
Collins of course uses an asset to log these kinds of messages against.
By default the system asset is the multicollins.thisInstance
value, or tumblrtag1
. You can
specify the system asset via the features.syslogAsset
configuration.
IPAM for engineers, API included
Collins has an IP Address Management (IPAM) system built into it. The IPAM system is used for allocating both IPMI addresses and typical addresses. Addresses are configured in pools (which typically correspond to a VLAN), but can also be configured to be pool-less in the case where you don't manage your own IP Address space.
Collins will prevent duplicate IP address allocation, and will almost always use the smallest available address in a range. It is possible to allocate an address against any kind of asset. This is sometimes useful for instance when managing a VIP (virtual or floating IP address). You can create a configuration asset that holds the VIP for a service, then link that asset to others that will actually share the address.
In addition to address allocation, collins provides the ability to do other things you would expect from a typical IPAM system such as querying used addresses, understanding what an IP space looks like, finding assets in a pool or by address, etc.
At Tumblr we combine the IPAM functionality of Collins with the per asset LLDP data to automatically manage switch provisioning. We also use this data for generating kickstart files with the correct address information.
The fundamental idea with Collins IPAM is that of a pool. A pool is a named group of addresses. A pool definition will specify the network address range (specified in CIDR notation), an optional start address (e.g. the IP to start allocating from in the specified range), an optional gateway (if it's not the one you would infer from the CIDR range), and a name. Once a pool is configured it is possible to allocate addresses in that pool. If you don't manage your own address space, no worries. You can operate in a a 'pool-less' mode where you can specify any address.
There is more information available in the API section as well in the configuration section to configure address pools.