Meltdown, Spectre and “Cloud Myths”

With the news, impact analysis and resolution of the Meltdown and Spectre fun still ongoing, it’s fair to say there’s been a few difficult discussions ongoing worldwide about how to resolve the issue but there’s also been the fair share of #fakenews making its way around that could be considered “Cloud Myths”.

Without going into the details too much of Meltdown and Spectre (because there’s much more intelligent security teams publishing details about that), let’s address some of the Cloud Myths:

“These updates are going to cause me unacceptable downtime”

AWS, Azure, GCP and others have all been completing patching and rolling reboots for their infrastructure (or will do so soon) and additional patching may (depending on your platform) be required for your instance level Operating Systems with additional reboots.  Unfortunately, there’s multiple complaints of those reboots meaning downtime to users.

As per a similar discussion I had when there was an S3 outage in a single AWS region, I’m putting this firmly in the “failing to architect” camp.  If a single hypervisor or instance reboot means that your environment is going down, then you’re not going to have a good time on Public Cloud environments.  Architectures need to be designed to reflect the characteristics of Public Cloud, where individual components fail, but environments don’t by making use of appropriate design.  Horizontal scaling, elimination of single points of failure (both at the system, but also Cloud level with Availability Zones/Regions) using automation and cloud native tooling and/or potentially multiple Cloud providers will all mean that this wouldn’t be an issue for you.

“I’m going to take a 30% performance hit, which will impact my users”

Performance Metrics graph
Performance metrics (Credit: https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/USER_Monitoring.OS.html)

Initially reported ballparks for the impact of the necessary fixes/patches cited “a ballpark figure of five to 30 per cent slow down, depending on the task and the processor model.”  People seemed to ignore the second half of that quote, namely the “depending on the task” part.  Without considering your individual workload, it’s impossible to know what the impact will be.  This is also why any remediation should, ideally, be done in a phased approach with monitoring of the impact.  You should also be able to track impact both at the infrastructure level through monitoring, but also (and more importantly) at the user experience level using Application Performance Monitoring to see if this is true.

The second half of the myth on this one is that if your resource utilisation changes: if you’ve designed the application and architecture correctly, your users shouldn’t notice any impact.  Public Cloud gives you the ability to automatically scale your resources, without any practical limit with planning.   Your users shouldn’t notice and you won’t need to wake up in the middle of the night to worry about it.  The only difference between this patch and the peaks you’ll see on your application based on end user usage is that the CPU patches may cause a baseline resource requirement increase on a permanent basis.  At the end of the day, this is an increase in resource usage your environment should be able to handle.

“I’ll lose my 3-year Reserved Instances”

Depending on your Cloud platform and commercial choices, you may have chosen to commit to a particular amount of usage, over time for a discounted rate.  This isn’t uncommon, is recommended, but must be considered pragmatically across more than just the commercial impact.

A lot changes in 3 years, both from your Cloud provider announcing new products/services, pricing or instance sizes, so committing anything longer than a year to 18 months dramatically increases the likelihood of you being stuck whilst you could move.  Yes, the discounts improve too for a longer-term commit, but as people are seeing now, if they want to move from once instance to another to reduce the impact of a CPU patch (or change to use a newly announced product alternative etc.), they could see a significant commercial impact they weren’t predicting.

“I don’t need Reserved Instances”

So, don’t get me wrong here, you may have read the above and sworn off reserved instances/capacity commits, but that too is a myth.  Reservations can be a terrific way to optimise the cost for your static/baseline workload resources, but they also can be useful to get a guarantee of available resources from your Cloud provider.

Even at the scale the “Public Hyper-clouds” operate at, if the “30%” figure was correct, even they would notice the hit of additional capacity requirements happening basically overnight.  From the statements they’ve made, it appears they’ve not noticed a discernible impact (and this appears to correlate with a quick scan of AWS spot pricing history for example), but outside of this event it can help to reduce the risk of capacity limitations from your Cloud provider for high traffic periods (for example Black Friday sale periods) that you’ll always have your baseline resource availability you’ve committed to.

“It’s my Public Cloud provider’s responsibility to secure my environment”

AWS Security Shared Responsibility
AWS Security Shared Responsibility (Credit: https://aws.amazon.com/blogs/security/new-whitepaper-aws-cloud-security-best-practices/)

Partially true, partially myth.  Most Public Cloud providers will be responsible for the security of the Cloud, for example securing their Data Centres, Networking infrastructure, Servers etc.  Customers are responsible for security in the Cloud, for example patching operating systems, configuring network security rules etc.

The issue at point here needs to be resolved by both your Public Cloud provider and you.  Cloud providers will have/are going to patch the underlying hardware and equipment, which prevents different compute instances accessing data outside their own on the same hypervisor via this bug. Patching and security of individual instances need to be completed by every customer still additionally, following the specific guidance of the appropriate vendors as appropriate.

“Intel/my Public Cloud provider owe me a refund for the lost performance”

It’s an interesting one, but I’m expecting this one to remain as a myth.  Cloud providers generally price their services based on a provision of vCPU and/or RAM etc. and there will be no change in the amount of the vCPU/RAM etc. they provide.  Yes, there may be a reduction in performance for your workload of those vCPU, but they’re still providing you the same resource counts.  In fact, it’d be concerning if they didn’t patch the environment.

In the longer term, we may see a change in pricing which could go either way (cost down to reflect a potential decrease in performance against cost, or up based on the higher resource utilisation of the environment to achieve the same workload, impacting resource commits on hypervisors for the Cloud provider etc.).  Too early to tell on this one yet, but I don’t expect this to directly correlated in any pricing changes.

End of the day learnings

Public Cloud comes with many benefits in technical capabilities, products, services, commercials and scale which power multi-billion companies, but like everything it needs to be used effectively.  Throughout the above and response to the CPU bugs, the multiple benefits of Public Cloud, not only during this event (such as the automatic patching of hundreds of thousands of hypervisors with minimal failures, securing your data from others in the environment) but also in the longer term has been glossed over.

Don’t discount the benefits simply because of ineffective use of the Cloud at this time.  Instead plan and architect appropriate for these and other scenarios likely in your environment in the future to realise the benefits.

Leave a Reply

Your email address will not be published.