Saturday, March 12, 2016

Conflict in Engineering Organizations

My best days at work are the ones full conflict. My worst days at work are the ones full of conflict. I am not a schizophrenic but am referring to two completely distinct types of conflict. On the good side, I love cognitive conflict. It challenges me to wrestle with new ideas and think through different perspectives. Brainstorming, white-boarding and out-of-the-box problem solving are all examples of cognitive conflict thought processes. It is the type of conflict known to drive innovation. It is what we should strive for at work. 

I hate affective conflict. When the team’s energy turns from solving a problem to “solving a person” we have affective conflict. At work this type of conflict can be driven by situations of ambiguity in roles and responsibilities. It can also be driven by individuals who believe the corporate economy is a zero-sum game and look to consolidate power in strong-armed or manipulative fashions. Mental energy is no longer focused on the mission. We now have ‘Game of Thrones’ style corporate politics.

To break out of the cycle we should start with role clarity; knowing that it drives employee morale. There is nothing more important for productivity than employee morale. Most companies that are known for great employee morale and culture also have incredibly productive employees. Keep a casual tally of interesting open source projects or great technology blogs you read. Where do the authors work? Notice any patterns?

Furthermore, what fascinates me is how corporate culture impacts technology choices. The obvious example is Conway’s Law: "organizations which design systems ... are constrained to produce designs which are copies of the communication structures of these organizations.” In my experience working with engineering organizations, simple and elegant systems are produced by teams who care more about the mission (cognitive conflict) and less about job security and pet projects (affective conflict). An engineer who solves a problem in a way that is “stupid simple” cares about the mission as demonstrated by an easy to integrate system design. While, building complex solutions is a great way to stake out corporate real estate because few others will understand the Frankenstein system. 

Morale and culture are force multipliers. Done right and we have energized employees who innovate. Systems are produced that are elegant and simple. Their architecture is easily communicated to other teams and become leveraged in other projects. Done wrong and we have a cycle of protective technology decisions or ones done with little input from other teams. These systems produce technical debt and are a burden to the organization. Therefore, maximize cognitive conflict at work and minimize affective conflict. Strive for clarity in roles and manage morale. Nothing is more important for engineering teams to succeed. 

Saturday, February 6, 2016

Top Five SRE Architecture Principles

When interacting with software developers at work, our site reliability engineering (SRE) team found common themes when discussing scalability issues surrounding applications we support; independent of the service under consideration. Therefore, we discussed ways which we could concisely communicate our expectations for application architecture when operating in a cloud environment. What we came up with is a distilled list of principles (only five) that we refer back to when consulting on new projects or evaluating technical specifications.

  1. Stateless - The state of a service should be determined by a shared database and not dependent on data local to application. Storage should be treated as a service within itself and antiquated thinking of storage as a device should be avoided. 
  2. Scale linearly - An application should run as a single process on a small as possible footprint. This enables the SRE team to scale services linearly in a granular fashion. Code logic should not necessitate a specific number of instances but be capable of scaling up or down as load changes. Discrete functionality is preferred such that there is a single and obvious metric to scale upon. 
  3. Minimal configuration - Services should require little to no configuration. We have found configuration management to be a ripe source of human error, therefore services should ship with sane defaults and infer as much as possible on startup from consideration of environment variables or service discovery. Thread and memory footprints should configure automatically maximizing the resource usage on an instance. A side benefit is that the less configuration options available on a service the less permutations are needed for testing. 
  4. Robust communication - A great quote from Release It! is, "Integration points are the number-one killer of systems. Every single one of those feeds presents a stability risk." Therefore, we ask that all integration points of a service be enumerated and have a proper harness to torture test data input and output. Communication should be asynchronous whenever possible (and it almost always is). Adequate controls at integrate points should existing including circuit breakers, time-outs, bulkheads and protocol hand-shaking. 
  5. Application Visibility - At a minimum all inbound and outbound transactions should have telemetry that provides visibility on the number/size of transactions, their type and the time the transaction took to execute. Services should know if they are functioning correctly and make this health status available to the SRE team through some type of API (usually REST). Logging is also a critical component of application visibility. It should be obvious when reading the log files whether the service is working. 
There were three sources that we relied heavily on for inspiration. I want to explicitly call these out so we can give credit where credit is due and encourage people to look up these fantastic resources:

Finally, a word on the importance of principles. Modern systems are complex. Distributed, cloud systems are particularly complex because of the number of integration points. The allure of intuitively understanding systems with thousands of nodes is toxic. When designing these distributed systems we prefer the pattern/anti-pattern approach where principles are preferred over attempting to enumerate every possible failure scenario. You will hit failure edge cases you never thought of. Therefore, relying on principles instead of our own capability to exhaustively understand a system is a practical dose of humility. 

Saturday, July 18, 2015

Anomaly Detection with Holt-Winters in Graphite

My final post in this series on anomaly detection in Graphite will deal with Holt-Winters functions. Graphite has a few functions here that are based off of Holt-Winters predictions. I will attempt to look at the use of some of them and end up showing a simple way for alerting on anomalies similar to timeShift() and coefficient of variation.

Before we get to Holt-Winters it is probably a good idea to explain a concept called smoothing and how it can be helpful when trying to understand data. For a variety of reasons when examining time-series data large variations can be seen in a single data point or small group of data points that are not interesting from an analysis standpoint. The non-technical term is called a "fluke". If you have an alerting system wired to do pure threshold based monitoring, i.e. if you see latency of a transaction greater than 500ms send an alert, you could get a large number of false positives due to the occasional fluke.  Quirks in how the application records metrics or even your metric collection system itself can contribute to this phenomenon.

Therefore, it can be beneficial to smooth the data out before performing any action on it, such as alerting. One simple way to do this is through the use of window functions. In a time-series data set they simply take the last 'K' number of data points and perform a function on them. That function could be a sum total, lowest, highest, 90th percentile, etc... Graphite provides the summarize() function to do just that. Sticking with our latency example you could plot the average latency from a data set over the past five minutes.

Window functions are useful in so far as you can assume the most recent data is the most relevant. Variations on window functions even allow for things such as assigning weights where more recent data is "counted more" and older data is "counted less". The exponential smoothing function is a type of weighted average. As you get further away from the current time the data points count less and less towards your smoothing operation.

The above discussed functionality doesn't really account for data being seasonal or to put another way that it can trend. In the last post I used the example where transaction volume increased during business hours. Techniques called double and triple exponential smoothing were invented to account for both the relevant timeliness of data and its seasonal nature; of which Holt-Winters is one. Check out the Wiki page for a breakdown of all the statistical equations.

Graphite has four functions that can help plot Holt-Winters series. One of those functions is the holtWintersConfidenceBands(). It plots an upper bound and a lower bound series based on data for the previous week to seed it. Below is an example on some transaction data. The blue line is the lower bound, the green line is the upper bound and the red line is the actual data set. You can see that Holt-Winters did a pretty good job predicting where the real data would fall.

In addition to the confidence bands, Graphite has another useful function called holtWintersAberration(). It takes a series and plots the delta between what Holt-Winters predicted and the actual value. Similar to what we did in my last two posts we can take this value and create a type of dimensionless heuristic using the following function by relating the Aberration to the original metric. 

The Holt-Winters Aberration can either be positive or negative depending on its relationship to the original metric. In order to simplify alerting, I take the absolute value so my alerting function in Seyren can be a single vector. In order to "weaponize" this for production pick a metric of interest and look at the historical values of the above function. Cross-reference those values with known trouble times and select an alerting threshold based on that comparison. I will mention again that I simplify a lot of what is covered in the past three posts in a simple shell script that is available in my Github repo

Saturday, July 11, 2015

Anomaly Detection with timeShift() in Graphite

In my last post I discussed combining Graphite, Seyren and a little math equation called the coefficient of variation to come up with a statistical way to detect anomalies in your time-series data. As promised in that post I will cover two more functional capabilities, native to Graphite to find anomalies. The first and simplest of these functions is called timeShift() which I will review here. It operates on a series of data and shifts it a user specified amount of time. For example the function below shifts the data target metric back seven days.

Viewing the timeShift() on a metric would be an interesting thing to spot check on a dashboard. For example what is the average number of 404's my site gets per hour on this day at this time. (Related there is another Graphite function called timeStack() that graphs the data for a specified interval that would likely be even more useful for a dashboard.)

However, our purpose in this blog post is not to simply draw a helpful dashboard but to do some basic anomaly detection by creating a data series that is a dimensionless heuristic similar to the coefficient of variation. In order to do this we can relate the timeShift() to the metric it is operating on. This can be accomplished by taking the original metric, subtracting the timeShift, dividing by said metric and then taking the absolute value (While taking the absolute value does result in some information loss it makes it easier to produce single dimension alerts). Represented in mathematical notation:

This function can be used on nearly any time-series metric to produce a meaningful calculation. In Graphite it is represented in the function below; where $METRIC is the interesting time-series data you care about and $SHIFT is the amount of time you want to shift back (i.e. 7 days). 

You now have a metric that can be monitored in Seyren. By plotting this historically and comparing its value to when there was system trouble it is possible to create meaningful warning and error alerts in Seyren.

In closing we should lay out the assumptions implicit in detecting anomalies with the coefficient of variation and the time-shift approach outlined above. Coefficient of variation calculations assume that recent data is most relevant to predicting anomalies. It essentially asks the question, is this data very different from a recently calculated mean? Alerting on it requires you to ask how much is too much for this data to change in this time window?  The time-shift assumes something entirely different. It assumes your data is seasonal, i.e. it has predictable patters such as business hours traffic. When you alert on this type of data you compare it to a previous time period and check if it varies within an acceptable level. Both coefficient of variation and time-shift checks have their place. However, what if you could have your cake and eat it to? That is essentially what Holt-Winters is all about and on my next post I will cover how to write Seyren checks based on those calculations.

Friday, July 3, 2015

Anomaly Detection with the Coefficient of Variation in Graphtie

Whether in operational or security metrics anomaly detection can be a tricky thing. Trying to nail down a reliable heuristic that can work across a varied set of time-series data is not easy. Static checks on the other hand can be much more straight forward; put a message in Slack when the disk usage exceeds 95% or send an alert to Pager Duty if average user CPU exceeds 90%. Given the simplicity of direct threshold checks why even attempt to come up with anomaly detection? Answer: automation.

At work we monitor multiple environments with a wide range of application level time-series transactional metrics. These metrics can vary greatly between environments. Instead of trying to predict transaction rates ahead of time it would be beneficial to automate the creation of a check that was independent of the environment yet high fidelity enough that it would be worth paying attention to. In our shop we send application level metrics to statsD which forwards to a Graphite cluster. We alert on this Graphite cluster using Seyren.

Graphite has some pretty cool functions built in to help with anomaly detection such as Holt-Winters aberrations. Perhaps in a later blog post I can comment on how to make use of it. For this post I want to concentrate on a common statistical equation called the coefficient of varation. From Wikipedia:
The coefficient of variation (CV) is defined as the ratio of the standard deviation to the mean. 

The CV is valuable because it assumes that the standard deviation should always be related to the mean. It is especially useful here because we needed a generic calculation that would be independent of the transactional metric being checked. It is also important to note that the CV is a dimensionless metric. Therefore you get a type of heuristic after you make the calculation. 

Thankfully, Graphite has a standard deviation function built in as well as a divide series function. In Graphite you can build a function that looks something like this to calculate the CV:

In the above $METRIC is the Graphite metric you are trying to calculate. An example would be a namespace something like: server1.application_metrics.transactions_rate. The $CV_WINDOW is the length of time you want to go back to calculate. I think Seyren has trouble going back more then ten to fifteen minutes but I could be wrong.

To make it actionable you place the formula first into Graphite and see what historical data looks like. Cross reference any known trouble spots by the CV value and you can start to formulate an alerting threshold. Perhaps you want to warn when the heuristic gets to 3.0 and alert when it gets to 4.0. Input those values into the Seyren and you will get a graph that looks like below.

At the beginning of this post I said one of the driving reasons for going through all of this trouble was automation. Seyren has an API that can be used to set up alerts. I have hosted on my Github account a shell script that can automate CV checks as well as time-shift and Holt-Winters (I will try to cover my approach to these two in a later post):

The idea is that you can wire up the script to fire when a new environment or application server is provisioned.

Saturday, May 31, 2014

Vulnerability Data into Elasticsearch

My day job has me focusing on Elasticsearch more these days. A while back I did a post on getting vulnerability data into ELSA. As a follow up I have been meaning to write a brief post on how to do the same with Elasticsearch. If you are not familiar with Elasticsearch go check it out here. From their website it is classified as "distributed restful search and analytics". It is often combined with Logstash and Kibana forming the "ELK" stack. The same reasons that having vulnerability data available with your event logs was a good idea in ELSA also apply if you are using the ELK stack. I modified my existing script to take an input file from one of several vulnerability scanners and index the results with Elasticsearch.

Before we begin my Python script makes use of the Elasticsearch API. I installed it via pip:

# pip install elasticsearch

I assume  an index exists called vulns. You can create it by hitting up the Elasticsearch API like this:
$ curl -XPUT http://localhost:9200/vulns
Different vulnerability scanners present time formats slightly different. It is a good idea to format it appropriately. For more information in the Elasticsearch docs check here. This is a sample API call you could make:

After the indexes are created you can run the script with XML output from a vulnerability scanner as input.
python -i nessus_report_test_home.nessus -e -r nessus

I have created a very simple dashboard in Kibana to visualize some of the vulnerabilities.

The script and dashboard can be found at my Github page:

Thursday, November 28, 2013

Vulnerability Data into ELSA

At Security BSides Augusta I released a script that would take a variety of vulnerability scanner data and import it into ELSA. I have been meaning to get a blog post about its usage but just haven't gotten around to in. With a couple days off of the holiday, here it is.

First the script is called and you get find it at my Github account. I have created Nessus and OpenVAS to ELSA scripts in the past. This script combines all of the above plus it adds support for NMap and Nikto all in one place.

The script is very straight forward to use. Simply give it a Nessus, OpenVAS, NMap, or Nikto output report in XML format and an ELSA IP address and you should be off to the races.

$ python –i report.nessus –r nessus –e elsa_ip

Before running the script for the first time you will want to create the XML and SQL file for ELSA to recognize the syslog output the script provides.  The -x and -s option will automatically create it for you and output them to files.

"Usage: [-i input_file | input_file=input_file] [-e elsa_ip | elsa_ip=elsa_ip_address] [-r report_type | --report_type=type] [-s | --create-sql-file] [-x | --create-xml-file][-h | --help]"

As always I welcome feedback and would be happy to add any more vulnerability assessment tools to it if you have recommendations. I would ask that you send me a sanitized output report file since I might have limited access to the tool.