Leveraging Cloud for Disaster Recovery Services

Did you ever wonder why there is so much excitement around the phrase “DR Service from cloud”? What does that actually mean? What are the expectations?

Well, then you might want to read the article on the subject at Meghvaani, Newsletter of the Cloud Computing Innovation Council of India (CCICI), Volume I, Issue 1.

Let us leverage cloud in ways that make it efficient in handling IT requirements to fulfill business objectives of an enterprise.

It is important to evaluate cloud services before purchasing

Intended Audience: CxOs, Other IT decision makers

Coffee with Steve

It was a Friday afternoon in late 2013, when we reached the CEO’s office of ABC corporation (name changed), to share what we have been championing — Cloud Evaluation as a Service, and to learn from their experience as they had recently started using the cloud to increase the capacity of their data center.

Steve (name changed) is the Group CEO of ABC corporation having many verticals. They recently signed a big contract for delivering a year-long corporate training program which required a significantly expanded data center to deliver the program efficiently. Instead of adding new physical capacity to their existing in-premise data center, Steve decided to use public cloud services from a major vendor. 

Alex, an IT manager in Steve’s team, had prepared a detailed proposal using which Steve could make his decision about going to a public cloud vendor. The proposal was based on several important requirements that need to be considered while selecting the public Cloud Service Provider (CSP) for their requirements. Let’s look at some of those requirements.

As many training modules were developed in .NET, they initially thought that one of the major cloud vendors supporting .NET would be the natural home for them. After some research they discovered that cloud services from this vendor were still maturing, remember it was sometime in December, 2013; moreover, this vendor’s cloud support could not respond convincingly to many of their queries. This prompted Alex to consider another major CSP with a huge market share to understand the services they offered. This second vendor, being one of the leaders in cloud service offerings, had solutions for all technical requirements of ABC corporation. The overall cost of computing resources also looked reasonable to Alex. Finally, they considered two other vendors, reviewed their capabilities and attempted to match them to their requirements and business priorities before making a final purchasing decision. 

Based on the internal review and the proposal from Alex, Steve decided to start their journey into the cloud with one of the biggest CSPs in the business. What Steve didn’t realize is that while Alex did a reasonable job comparing CSPs, he also relied on an old adage that said ‘no one gets fired for recommending IBM’s servers’. Alex adopted that adage to cloud and transformed it to say that ‘no one will get fired for recommending one of the leaders of Gartner’s Magic Quadrant for public cloud services’.  ABC corporation expanded its capability through cloud; delivery of training modules started for the major new contract and Alex’s recommendation and efforts were very much appreciated.

Then came a bolt from the blue! The monthly bill for compute resources, which was estimated to be around $1500, was almost touching $2000 — 33% more than expected! This was a major crisis considering upcoming expansions, and Steve started wondering if he did the right thing relying only on Alex’s internal recommendations while selecting the CSP. 

Steve and Alex got into a detailed analysis of the billed items to see what caused the surge and how they can manage these costs better. 

Lessons Learnt

  1. Their analysis showed that the unplanned excess in billing was caused largely by their applications exceeding the allotted IOPS (I/O per sec) thresholds. Their chosen CSP billed its customers for IOPS, even though this is something which is generally not in a customer’s control. Even plain vanilla instances with just the OS installed on it can have significantly large IOPS (possibly triggered by OS updates etc.), thereby causing possible billing surges. 
  2. Since the in-house IT team did not understand the cloud billing model well, they did not shutdown the ‘not-in-use’ instances, incorrectly assuming that ‘not-in-use’ also implies ‘not-to-be-billed’. 
  3. Again, due to lack of experience with cloud storage, they were using the most expensive form of online storage available when most of their needs could have been met by less expensive archival storage.
  4. Steve concluded that he should have used a professional, unbiased evaluation agency to help him select the best CSP for his requirements – one that felt no need to stick to any adage !


An SOS from Adrian

It was a Sunday morning in the summer of 2014. I got an intriguing call from Adrian, the CEO of a successful enterprise. “How do you say that these cloud services are elastic? And here, we have lost data of two days!” — Adrian wanted an explanation for something I did not have any context for.

After calming him down and talking to him for a few minutes, I got the complete picture. Adrian was very upset that business critical data captured by the thousands of sensors across different geographical regions of his business were not being recorded consistently by his application servers. This was absolutely critical for his business analytics service to be useful to his customers. 

The main issue was that their instances were consuming the allotted quota on the Solid State Drive (SSD) storage and the data received on Friday and Saturday evening could not be saved. In fact, one particular week, the SSD limit was reached on Friday evening itself and the issue was discovered only on Saturday night when the automated weekly summary of activities showed several error messages!

On top of it, when they contacted their cloud vendor for additional SSD storage space, they were told to move to a new instance with more SSD space to get a contiguous space; the process took extra time making their downtime longer.

They had made their decision to go with this vendor based on the suggestion from their IT Manager, Samuel who had recommended this particular vendor ONLY for its very simple and easy to understand pricing model. However, not considering many other factors that are critical while making cloud purchasing decisions was making them run into unexpected issues. 

Lessons Learnt

  1. Only one factor was considered before selecting the cloud vendor when other important factors like supportability, elasticity and ease of use should have been considered. 
  2. Notifications and messaging were not setup correctly for exception situations. 
  3. They had incorrectly assumed that sufficient resources had been provisioned for development and production environments and both were deployed within the same cloud configuration.

Summary

While most CxOs understand and follow extensive processes – including evaluation – when it comes to, say, hardware procurement, when it comes to cloud resource procurement, required due diligence is often not done – perhaps because of the thought that it would easy to switch to another cloud if one does not work.

In the case of Steve, they should have understood every aspect of the vendor’s billing process, instead of getting surprised after already committing their investments to that vendor. They should have realized that correcting cloud purchasing decisions can be expensive and its better to get a professional evaluation done to minimize risk.

In case of Adrian, some of their basic assumptions costed them dearly. The amount of storage required is a function of the rate at which the data is captured and the number of days those data should be kept in the online storage can’t be left on some assumption. In this case a cloud consultant or a cloud expert or even a solution architect from the cloud service provider could have helped in better scoping the requirements.

Corollary

  • Steve decided to migrate to another public cloud service provider to rein in operating cost and moved out of their first “cloud home” within one year.
  • Adrian separated out their deployment environment and enabled important monitoring and alert features from day one in the new deployment platform.


Notes:
These are real case studies. The names of the persons and the companies have been altered at their request. 

 

Demystifying Hybrid Cloud – Part 1

Intended Audience: Senior Technical Staff, Managers, CxOs

[1] What are hybrid clouds? Why do they exist?

Hybrid clouds are a set of technologies that is being widely adopted as a bridge between existing in-premise data centers and the fast growing public cloud ecosystem. Companies and large enterprises with massive investments in in-premise data centers are using hybrid cloud technologies to keep deriving value from their existing investments as well as benefit from the efficiencies that public clouds offer. It has also become clear that while public cloud adoption continues to increase rapidly, they cannot replace existing, proven in-house systems and processes entirely within a short period of time. In such an environment, hybrid clouds are providing a convenient mechanism for enterprises to try out the public clouds and adopt them in a phased manner if found fit for their application environments. 

CTO and CIOs in large enterprises are continuously looking at ways to increase process efficiency, improve scalability of their applications and IT resources, and ways to reduce IT capital expenditure for  short to mid-term requirements. They can see that public clouds vendors are fast building in all the technology and operational capabilities they need for such requirements. At the same time, they are also concerned about moving their entire application environment to a public cloud since it ‘seems’ to be outside their immediate areas of control and management. Some of them also believe that if there is a serious problem, they may not have the ability to resolve it since the infrastructure is now ‘outside’ their environment.

Since there is some element of truth to these concerns, hybrid clouds provide an excellent way to integrate the public cloud infrastructure into their existing environments so some of these concerns can be mitigated to a large extent. Hybrid Clouds are enabling the CxOs to retain control over their existing environments in a way they are familiar with and at the same time let them take advantage of the significant benefits that public cloud based infrastructure provides. 

In a nutshell, instead of creating disconnected silos of computing infrastructure, the CxOs are using hybrid cloud bridges to seamlessly integrate their in-house data center with the public cloud infrastructure. Public cloud vendors are also providing excellent capabilities to enable such an integration. We will take a look at some of these capabilities in Part 2 of this article. 

[2] A Few applications of Hybrid cloud technologies

For us to appreciate the applications and usefulness of hybrid cloud strategy, let us review a few known computing challenges and see how these challenges are being handled very well in a hybrid cloud environment.

A common pattern that is seen in many hybrid cloud applications is that first, the application workloads start with consuming in-house data center resources (typically compute); when the limits of computing resources is reached (virtual machines or jobs/requests in queue), on-demand instances in a pre-configured public cloud are created if there is a perceived loss of businessPerceived loss of business for an online e-commerce store could be losing its customer to its competitors and for a product company it would be like missing the deadline and thus ceding the market to competitors. 

Once the on-demand resources have been created and setup in the cloud, it just looks like an extension of the existing in-premise resources and then computing jobs/requests can be sent there to be serviced. Let’s look at some real scenarios where hybrid clouds have proven their usefulness. 

  • Running long regression tests before code check-in
    (Cloud Scenario: Test environment, Dev-ops)
    In a typical development cycle of an enterprise class product, 80% of transactions come to land (on the code base) in last few weeks! All developers submit a long list of test cases to the test-farm (a cluster of powerful servers on which requested tests are scheduled to run) and wait for the test results hoping to see a “clean test run” so they can merge their changes into on the code base. In many instances, first run is never “clean”, requiring developers to fix the issue(s) and rerun all required test cases again. As a result we often see long queues to get the test run completed, which may potentially result in the developer missing the dreaded “code freeze” deadline!

    A hybrid cloud based test environment is best suited to manage such peak demands that occur with regular frequency, but last for only a short period of time.  If a hybrid environment has been pre-configured, tests can be scheduled on the new short-term instances created on-demand in a public cloud infrastructure — thus significantly reducing the waiting time for developers to complete their test cycles. 
  • Handling seasonal spikes on an e-commerce site
    (Cloud Scenario: Retail application hosting, Online store)
    Several e-commerce stores record almost 80% of their annual turnover around the holiday season or major festivals and sales. Again since the demand peaks occur frequently, but not continuously, hybrid clouds provide the best mechanism to handle them without investing heavily in infrastructure that lies unused during off-peak periods.With close monitoring of the demand in real-time, and the consequent increase in wait-time for the customers, the e-commerce store can decide to move a set of the items to be sold from the newly set-up on-demand instance on a public cloud, thereby distributing the load more efficiently and resulting in happier customers who can complete their purchases faster.
  • Handling unprecedented load during flash sale of airline tickets
    (Cloud Scenario: Online ticket booking)
    To boost the top-line and to “acquire” new set of loyal customers, most budget airlines offer flash sales for a short period. Many times this kind of sale by one airline triggers announcements of sales by competitors — and in some instances, even the market leaders can not ignore the sale announced by a new entrant! Any airline that introduces such sales expects to see heavy customer load in a very short period of time and its vital for them to be ready to handle the sudden, short-term jump in potential customers.Such flash sales can result in unpredictable loads and no amount of planning may be enough to ensure enough compute and network resources when needed. Indeed, we have seen instance where flash sale sites become inaccessible to a large customer base, thereby defeating the very purpose of such a sale. In order to avoid potential loss of customers, goodwill and real business, travel companies can  adopt a hybrid model where additional resources can be quickly setup to handle peak loads during flash sales and scaled down again when not needed. This would allow them to grow their business without incurring a huge cost on setting up compute infrastructure that gets used intermittently.
  • Providing additional computing power to track a Tsunami
    (Cloud Scenario: Virtual Data Center, Big Data & Analytics)
    While the Met department of the government has got very powerful in-house servers, during Tsunamis, Cyclones, Hurricanes, etc. it requires additional computing power to run and re-run its weather modelling algorithms to determine the path of the impact – which can change many times in a few minutes – so affected people can be warned and evacuated. Also, after the impact, people across the world want to access the main  site for information about the event, causing the main site to be almost non-responsive because of unpredictable load. Redistributing this load to a set of public cloud resources using hybrid technologies is a good way to manage modelling as well as user loads during major weather events.

In the next part of this article, we shall highlight some of the hybrid cloud solutions and technologies that are in common use and also review some of the pitfalls to avoid while setting up a hybrid cloud  environment. 

It makes sense to relook at corporate cloud strategy time to time!

Intended Audience: CxOs and other IT decision makers

We have seen recent news items where an established cloud based businesses switches out from one of the leading cloud vendors to one that is not in the top 2-3 of the leader’s board. This kind of news often create a lot of noise and confusion that eclipses the opportunity to look at such developments pragmatically, and to learn from them.

When Hostnet Brazil, a cloud hosting and solution provider, made it public that they had decided to switch over to the infrastructure provided by CloudFlare by the end of 2015 for their entire network in Brazil, many in the industry were surprised and everyone was eager to know the prime reason. Hostnet found that with CloudFlare, web pages of their customers in Brazil loaded, on average, almost 2x faster! (Read more)

In case of Marks & Spencer, Britain’s largest clothing retailer, the company decided in early Feb 2014 to create a new platform for their websites. Marks & Spencer’s websites until then were running on an Amazon AWS provided platform since 2007. It drew attention as M&S decided to switch out of AWS public cloud and to set up own private cloud platform after 7 years of association with AWS. M&S said in its statement that the move was inline with its new “multi-channel” retailer strategy and that it would complement its e-commerce initiatives as a means to reverse the decline in market share of its business. (Read more)

What became obvious from the above cases is, the enterprises using public cloud services need not remain forever with the cloud service provider they selected in the first place. In fact, it is important for every business to review their cloud providers with regular frequency and plan to switch if one finds a significantly better vendor offering better services at better costs.

This is not a new concept since companies are used to doing hardware refreshes for their in-premise data centers regularly. There is no reason why the same process should not be applied to resources being used from the cloud.

Clearly changing a cloud service provider for valid reasons is not a bad thing and in fact, may be good for business. Let us try to identify some early triggers (other than just time) when an active re-assessment of the current cloud services may be required:

  1. Changes in business requirements and/or priorities and the inability of the current cloud service provider to fulfill the requirement(s) within a given budget.
  2. Availability of new technology; new/distinctive capabilities that a cloud service provider offers (e.g. Speed – in the case of CloudFlare).
  3. Non availability of a new platform, for a new solution, in the existing offerings of the current Cloud Service Provider (as in the case of Marks & Spencer).
  4. New business/service expansion strategy (as in the case of Marks & Spencer).
  5. Changes in Government policy and regulations, e.g. sensitive data to reside within in-country data centers.
  6. A business deal resulting in reduction in operating cost, or a partnership deal to utilize services from each other’s offerings.
  7. Merger and acquisitions requiring better utilization of the combined cloud resources.
  8. Withdrawal of services, changes in key engagement terms by the current provider, or  deterioration of service quality.

Finally, an ongoing review of the cloud strategy is, in fact, a healthy sign that the enterprise is continuously measuring its operational efficiciencies, cloud ROI and then taking the corrective action as needed.