Building operational excellence

For our latest MSP audit, AWS raised the bar on Next-Generation Monitoring Capabilities. The topic discussed in this post is an excellent example of how the MSP programme should operate, continually challenging its partners and driving customer quality through implementing its latest services and best practices.

Published at

16 September 2021

This behaviour is how the Nasstar AWS practice has evolved through technical innovation since onboarding into the programme in 2015.

The Challenge

AWS Managed Service Provider (MSP) partners will know that this control is centred around deep monitoring of a workload from an application and infrastructure perspective. Additional requirements around machine learning (ML) have been included in the scope to filter the signal from the noise without extensive manual intervention.

This time around, AWS had raised the bar on this requirement through the introduction of anomaly detection and that it must be applied across multiple metrics and layers.

Nasstar caught this control early enough to design and implement a suitable solution in time to capture the required data for evidence.

The Solution

Historically, Nasstar fulfilled this requirement by using third-party tools. While this is an acceptable approach, it can be cost-prohibitive for some customers and time-consuming to configure, making it difficult to adopt as a standard for all customer workloads.

In 2021, with recent improvements to the Amazon CloudWatch service, Nasstar were fortunate to have caught this control early enough to design and implement a suitable solution in time to capture the required data for evidence. Nasstar evaluated the art-of-the-possible using AWS native monitoring services to see if these could directly address the requirements for a standard next-generation cloud monitoring solution.

Solution Architecture

Amazon CloudWatch Anomaly Detection is used on appropriate alarms where a static threshold cannot be defined, like those which experience traffic patterns.

Most alarms are children of a composite alarm, which helps to reduce the noise whilst also resulting in a cleaner, more intuitive dashboard experience, particularly for complex workloads with large numbers of instances.

To collect default and custom metrics and relevant OS/application logs, the CloudWatch Unified Agent is deployed through AWS Systems Manager (SSM). Each workload uses a configuration file centralised in the SSM Parameter Store for reduced management. Metric filters are used on specific logs to identify an increasing number of errors or security failures.

With every workload being different, in addition to the baseline monitoring, a pre-defined set of metrics is available for internal teams to pick from. Each metric has a default threshold and configuration so that teams can quickly deploy monitoring.

Amazon CloudWatch Application Insights is used on Microsoft SQL Server and Internet Information Services (IIS) workloads to gain deeper monitoring at the application level.

An individual dashboard per workload/environment is then created following guidelines to ensure that each dashboard is similar in its content and appearance, allowing support teams to understand a workload’s status without learning new dashboards quickly. Customer dashboards can be created to allow customers a transparent view of their environment and any third parties they employ to reduce ticket requests and system status.

This solution is templated and provisioned using AWS CloudFormation to make it a standard cloud monitoring solution that could be quickly deployed across multiple workloads and environments.

Future Enhancements

The list of possibilities to implement additional monitoring is endless. However, we prioritise our future efforts on customer requests to create a cross-account dashboard that provides a high-level view of workloads running across multiple AWS accounts and implements Synthetics, Container Insights and Lambda Insights.

Nasstar also plans to expand the solution to fulfil the optional MSP requirement covered under control 9.6.3 (Predictive Models).

Benefits

Nasstar observed the following benefits after deploying the monitoring solution.

Improved Observability. The process of following the MSP criteria helped Nasstar strengthen our application and infrastructure observability. In addition to the technical solution, the practice has completely modernised its design process to include application and infrastructure monitoring from the outset of any new engagement. This allows application developers to improve their understanding of critical telemetry that can be used for optimal workload observability, resulting in improved customer satisfaction.
Integration. The solution makes for a simplified CI/CD pipeline as all Infrastructure as Code (IaC) can be deployed using CloudFormation.
Cost reduction. We observed an additional AWS run cost increase of less than $50 a month on a typical workload with minimal prior monitoring configuration. This compares favourably for workloads with third party tooling deployed.
Engagement. With oversight of the dashboard, we found the customer and application team were more engaged with operational excellence, ultimately driving the quality of the managed workload.

Observations

We observed that ingesting large log files rapidly increased the AWS run costs for the solution, so careful consideration of the log files is needed. Ideally, pick smaller, well-targeted, high-value log files.

Also, the Infrastructure as Code (IaC) can become large and complex with CloudFormation scripts, easily exceeding 10k lines of code just describing the monitoring resources for a workload.

Composite alarms worked well as roll-ups on a dashboard, abstracting the complexities of deeper monitoring. For example, it is possible to display the aggregation of CPU metrics for all EC2 instances in an auto-scaling group behind a single alarm, then drill into this for further detail if issues occur, for example, to identify which instance(s) are in the error state. This provides for a more intuitive user experience.

A standard log group filtering pattern highlighted that one of our internal monitoring tools failed to authenticate to all EC2 instances, so we were quick to react and resolve this, thanks to the intelligence presented by the new solution.

When shared, our service managers were instantly receptive and provided universal praise. They realised the value and were keen to propose it to their customers, who approved of the solution.

Instant Return on Investment (ROI) by the Nasstar service management team through the following use cases:

Through exposing Linux instance auth logs, we discovered that a legacy monitoring tool had not had its password updated, so we could see it effectively brute-forcing and attacking a group of EC2 instances.
By exposing S3 storage capacity, we could quickly tell when a significant amount of data had expired through an incorrectly configured lifecycle management rule.

Summary

In this post, we highlighted a new requirement for the latest AWS MSP audit, raising the bar on Next-Generation Monitoring Capabilities. We look at the art-of-the-possible for current AWS native services and observations from the Nasstar AWS practice which leveraged them to build a standard offering in break-length time to fulfil the recently updated MSP control.