Hyperscale Hardware: Components for the AI Cloud
SEE LAST PAGE OF THIS REPORT Paul Sagawa / Artur Pylak
FOR IMPORTANT DISCLOSURES 203.901.1633 /.1634
psagawa@ / email@example.com
June 8, 2016
Hyperscale Hardware: Components for the AI Cloud
As the TMT world moves resolutely into the AI/cloud era, investment in hyperscale data centers is surging, while spending on private enterprise data centers has begun an inevitable decline. Hyperscale architecture differs from the established enterprise approach in meaningful ways, with considerable implications for technology suppliers. The big cloud operators (e.g. GOOGL, AMZN, MSFT, and FB) typically buy standard parts in bulk for their self-designed modular rack systems, while enterprises rely on configured systems sold by OEMs, which often employ more feature rich or custom components. With the demands of massive cloud applications and cutting edge AI, hyperscale players also have some unique technology needs – e.g. very high speed optical interfaces, open switch fabrics, and AI-friendly graphics processors. Suppliers specializing in these areas (i.e. MLNX, ANET, IPHI, LITE, CIEN, MSCC, NVDA, etc.) without significant exposure to the traditional data center market could prosper. Other suppliers, generally systems vendors and the makers of costly ASICS designed specifically for their products, will experience increasing suffering. Component vendors that straddle the two markets – e.g. INTC, STX and WDC – are generally feeling the pain of declining spend by their enterprise OEM and device customers more acutely than the benefits of rising hyperscale investment. We believe that an inflection point may be more than a few quarters away.
- Huge growth in hyperscale data center spending. Led by the big four US cloud platforms (AMZN, GOOGL, MSFT and FB), spending on very large capacity, distributed (Hyperscale) data center infrastructure is growing at a 40.2% annual pace. These investments support the huge growth consumer apps, the exodus of enterprise IT into the public cloud, and the rise of deep learning AI. At the same time, we anticipate spending on private enterprise data center capacity will decline at a -7.7% CAGR, with weak demand for PCs, premium smartphones and other devices a further onus on many component suppliers. The net impact of this generational paradigm shift is likely to be highly deflationary on overall hardware spending.
- Hyperscale commoditizes hardware. Cloud operators eschew the high-margin integrated systems historically favored by enterprise IT, choosing instead to buy mostly standard, off-the-shelf components to be installed by contract manufacturers onto bare-bones modules of their own design. This has already been damaging for the makers of servers, storage systems, and networking gear, who now must compete with inexpensive hyperscale inspired “white box” alternatives in the private data center market as well. We estimate that 36% of all server capacity sold is now hyperscale or white box. Networking switches are a few years farther back, but on the same path, with white box up to 14% of industry sales.
- High performance networking opportunities. The sheer size of networked hyperscale data centers, along with the AI ambitions of the companies that operate them create new opportunities for capabilities unnecessary for the enterprise market. 10Gbps connections in private data centers are 40-100Gps in hyperscale facilities, with strong demand for high performance server interfaces, switches, and optical data center interconnect solutions. We see component vendors MLNX, IPHI, LITE, AMCC and MSCC as well positioned against these opportunities. Generally, these operators eschew configured systems, preferring bare-bones “white box” solutions built from open standard parts to run their own proprietary networking software, although very high performance switches and optical transmission gear from vendors like ANET, CIEN, NOK, INFN and others are also in demand.
- Deep learning needs GPUs. The architecture of a CPU is designed to queue up multiple different computing tasks to be performed in sequence, following software instructions at each step. In contrast, a GPU is designed to complete the same, single computing task repeatedly, as fast as possible. The highly iterative nature of machine learning is much better suited to GPUs than CPUs, prompting the biggest hyperscale operators to accelerate spending on them. GOOGL has designed its own ASIC solution, the Tensor Processing Unit (TPU), but AMZN, MSFT and FB have all opted for an AI optimized GPU from NVDA. MSFT is also deploying FPGAs from INTC’s ALTR for machine vision AI, favoring them over GPUs in this application for their superior power efficiency.
- Power efficiency. Electricity costs can be 25% or more of the total operating budget of a hyperscale data center. As such, cloud operators place substantial emphasis on power efficiency – note MSFT’s interest in FPGA’s for deep learning despite their performance disadvantage. The major cloud operators are also monitoring the progress of ARM-based CPUs from AMCC, QCOM and others, which could show significant power efficiency gains vs. x86 processors – we believe any meaningful shift in total spending is still a few years away. This efficiency focus is also apparent in the design of the power systems themselves – GOOGL is leading a move to all DC power components to avoid wasteful AC/DC conversions, a serious negative for makers of analog power components and systems.
- The good with the bad. Most data center component players sell to both hyperscale operators and system OEMs, some with added exposure to the deteriorating markets for PCs, tablets and premium smartphones. Moreover, the parts used in hyperscale infrastructure are typically simpler, cheaper and lower margin, than those sold into enterprise systems. For CPU maker INTC and disk drive vendors STX and WDC, growth in demand from the cloud may not be sufficient to offset the erosion of their traditional markets. While we had been optimistic that disk drives and server CPUs might find their bottom, we now believe that the inflection point is likely still several quarters ahead.
- Tough times for most systems vendors. While some cloud operators – notably AAPL and a few others – still rely on traditional systems companies to design and deploy their data centers, the momentum is overwhelmingly toward bare bones “white box” solutions, often manufactured directly to the customer’s own specifications. This is generally bad news for companies like HPE, DELL, EMC, CSCO, JNPR, IBM and others. However, the need for very high performance, particularly for networking between facilities, leaves room for vendors like ANET, CIEN, NOK, INFA, and others to compete successfully.
SSR TMT Heatmap
Clouds Are Building
Hyperscale data centers are cheaper, faster, more secure, more available, and more flexible to changing needs than private enterprise data centers. They also offer computing and storage resources far, far beyond what is practical in a personal device, enabling new application categories that have already established smartphones as indispensable for most consumers living above the poverty line. We have also noted that the rise of these extraordinary data processing platforms is fundamental to the emergence of the powerful deep learning artificial intelligence (http://www.ssrllc.com/publication/a-deep-learning-primer-the-reality-may-exceed-the-hype/) capabilities that we expect to effect the economy as profoundly as the World Wide Web did two decades ago.
Given these advantages, public cloud hosting by hyperscale operators – i.e. AMZN, MSFT and GOOGL – is taking significant workloads from private data centers, while consumer application franchises (FB, NFLX, GOOGL, Snapchat, etc.) based on the same architecture continue their torrid growth as well. Meanwhile, enterprise investment in data center hardware is flagging, as IT departments decide what and when they can migrate to SaaS applications and IaaS hosting. We believe that these trends will accelerate with time, moving more quickly than most forecasts, with the impact of the shift to the cloud painfully deflationary for hardware suppliers.
We divide data center hardware into three categories: 1. Products that are sold primarily to traditional enterprises, such as configured servers, RAID storage, routers and software-enhanced switches, firewall appliances, other system-level solutions, and AC/DC power converters; 2. Components and systems sold into both hyperscale and private markets – e.g. x86 server chips, disk drives, flash memory, etc.; and 3. Components and systems purchased by primarily by hyperscale operators to accommodate their leading edge scale, speed, cost or AI requirements, for example, high speed interfaces, high performance optical interconnect systems, open switch fabrics, AI-optimized GPUs, etc..
Companies competing in the first category, comprising most traditional IT hardware names like Dell, HPE, IBM, CSCO, EMC, NTAP, CHKP, and others, face deteriorating demand for their products with little chance of competing successfully for hyperscale business. Companies in the second category, straddling the two markets like INTC, STX, WDC, EMR’s Vertix business, and others, will suffer from the deflationary impact of the paradigm shift to the cloud, as it will take years before hyperscale growth fully offsets the impact of declining enterprise spending. Finally, companies in the third category are few but promising, names like MLNX, NVDA, IPHI, LITE, AMCC, MSCC, ANET, CIEN, and INFA stand out.
We are somewhat less enthusiastic for opportunities in solid state storage arrays. Hyperscale operators use flash storage for very low latency applications, but the all-in costs vs. disk drives remain far apart for the large majority of use cases. Moreover, as solid state becomes more cost effective over time, we would expect self-designed arrays using commodity chips to crowd out value added solutions from the likes of SDSK (soon to be acquired by WDC), NMBL and PSTG. Similarly, we are bearish on the near-term likelihood of a meaningful shift to ARM-based processing. The benefits of a change are still modest next to the switching costs.
Hyperscale Spending on Hyperscale Data Centers
Following Google’s lead – and the blueprint provided by its huge technical contributions to the open source community – most of the internet now runs on hyperscale data center architecture. The hall marks of hyperscale data centers are commodity hardware – processors, disk drives, and memory – installed onto barebones, and usually self-designed, server boards installed into interchangeable slots on interchangeable racks. The data centers, run without power hungry cooling systems, are massive, containing thousands of these modular racks. The racks are connected with very high speed networking interfaces connected to open standard switching fabrics, again, typically installed onto barebones “white box” switches. The hyperscale data center facilities are interconnected via dedicated optical fiber links fitted with very high speed transmission equipment.
All of it is managed by proprietary systems software that is able to parse large computing tasks into many small ones that can be performed in parallel across those many thousands of server cores, even across multiple data center sites, and store data onto those many thousands of disk drives. In this, applications requiring massive, unstructured data bases, such as indexing the entire web or archiving photos for more than a billion users, can be delivered with exceptional performance. This is the core concept behind “hyperscale” and it differs dramatically from traditional enterprise data centers, most built on the clustered x86 server and structured data base paradigm that came to prominence in the ‘90’s.
Hyperscale data centers are dramatically lower in cost than traditional enterprise data centers – we detailed the operating differences at length in previous research (http://www.ssrllc.com/publication/infrastructure-as-a-service-the-race-wont-go-all-the-way-to-the-bottom/) – and commercial IaaS hosting platforms offer a substantial and growing advantages vs. privately owned facilities, which we codify into seven factors: 1) Use of commodity components vs. value-added configured systems; 2) Minimal non-productive costs; 3) Much higher utilization; 4) Superior system availability and recovery; 5) Very low personnel costs; 6) Flexibility, scalability, power and convenience; 7) Substantial economies of scale (Exhibits 1-2). These factors enable as much as 90% lower all-in costs relative to the typical enterprise data center based on virtualized client-server architecture, and as much as a 50% advantages over less sophisticated operations applying Google’s paradigm on a smaller scale.
They also offer superior performance on a range of dimensions, including application scalability, available processing power, reliability, system availability, security, and support, enabling applications, such as deep learning-based image recognition and natural language processing, that had been almost impossible in the previous architecture. We estimate that revenues for IaaS providers, led by Amazon Web Services and Microsoft Azure, will grow at a 40.2% CAGR through the end of the decade, to more than $100B in global sales (Exhibit 3).
IaaS will be a huge market, but hyperscale data centers are also the backbone behind the dominant consumer internet franchises as well. Alphabet, running Google Search, YouTube and five other billion plus user services from its globally network of data centers, has the largest and most sophisticated computing infrastructure on the planet. Facebook, itself supporting multiple services with more than a billion users, has grown its PP&E nearly 10x times over the past 5 years. Amazon supports its own e-commerce and Prime media services in addition to its dominant IaaS operation. Likewise, Microsoft delivers Office 365, Bing, Xbox Live, Skype, Outlook and other applications, in addition to Azure. Collectively, we expect these four companies to increase their annual capital spending by $35B by 2020, with an ever larger percentage of spend going toward technology rather than real estate, driving a 20% CAGR in overall hyperscale data center spending (Exhibit 4-6).
Exh 1: Basic On-Premise versus Cloud Cost Comparison
Exh 2: The 7 Advantages of Cloud Infrastructure
Exh 3: Worldwide Cloud Infrastructure Services Forecast, 2014-2020 CAGR: 40.2%
Exh 4: Hyperscalers’ Net Plant Property and Equipment, 2010-2015
Exh 5: Forecast Capex Spending of the Big 4 Hyperscalers, 2012-2020
Exh 6: 2015 Technology versus Facilities Capex for Hyperscalers + Apple
Companies with Enterprise Exposure
Hyperscale data centers are dramatically cheaper to build and operated at much higher capacity utilization than the private enterprise facilities that they are replacing. In this, we expect the shift to the cloud to be highly deflationary for overall hardware spending. Enterprise spending on servers, storage systems and networking have been decelerating for several years, and we believe forecasts for linear declines in market demand are likely overly optimistic. We forecast enterprise spending on data center hardware to decline at an accelerating pace, averaging a -4.1% CAGR through 2020, with the following five year period likely to be considerably worse (Exhibit 7).
This is obviously bad for the vendors of data center products well suited to the traditional enterprise market, but ill-suited to the needs of hyperscale operators and their preference for barebones hardware solutions. This includes almost all configured systems – including servers, storage, routers, switches, firewalls and other security appliances, and AC/DC power supplies (Hyperscale operators are moving to standardize on 48V DC operations which would eliminate most on-site power conversions (Exhibit 8). In addition, Google has pioneered installing batteries on each server blade, thus eliminating the need for uninterruptable power). Sales in all of these categories have already suffered, and we believe hopes for a gentle descent are likely misplaced. Companies who have significant exposure to those product markets, such as Hewlett Packard Enterprise, Dell, EMC, NetApp, IBM, Cisco, Juniper, F5, Check Point, Emerson’s Vertiv spin out, and others, are likely to see top line disappointment in those lines of business, with little hope of reprieve.
Exh 7: Data Center Hardware Spending Forecasts, 2014-2020
Exh 8: Data Center Power Conversion Steps, Dedicated UPS versus Local Battery
Companies on Both Sides
While hyperscale operators do not buy configured systems, they do buy many of the same components used at the core of the boxes bought by enterprise IT departments (Exhibit 9). The suppliers of these components feel both the opportunity of rising hyperscale investment and the weight of eroding enterprise demand. Intel, which lumps both markets together in its Data Center Group, is an example. It has touted the strong growth of its “Super 7” customers – Alphabet, Facebook, Amazon, Microsoft, Baidu, Tencent and Alibaba – in forecasting 15% annual sales growth for the group, but stumbled right out of the gate with single digit growth in the first two quarters after making the forecast (Exhibit 10). The culprit appears to be falling demand for enterprise server capacity, as IT departments scale back on investment in anticipation of a move to SaaS applications or IaaS hosting in the public cloud. We believe that this tradeoff is likely to continue to thwart Intel’s ambitions for the DCG.
Exh 9: Component Cost Comparison, Hyperscale versus Traditional Vendors
Longer term, Intel’s fat DCG profit margins also raise the question of future alternative processing architectures for customers unaccustomed to paying top dollar. As a loyalty incentive, the Super 7 all have first look privileges, getting new chips 6 months earlier than ordinary customers, and 5 of the 7 design custom versions for their own use. Earlier this year, Bloomberg reported that Google was in talks with Qualcomm over the possibility of a shift to ARM-based processors, which might offer lower prices and better power efficiency. We discount this story in the immediate term, given the massive switching costs in shifting huge core processes over to a new architecture, but would not be surprised to see any of the Super 7 deploy ARM-based servers for discrete applications as a warning shot to Intel and a potential stepping stone to a bigger deployment in the future. Given Intel’s 98% market share in server processors, hyperscale operators will be looking for as much leverage as they can find.
Exh 10: Intel Data Center Group Segment Financials, 1Q14-1Q16
Exh 11: Enterprise HDD Shipments, 1Q14-1Q16
A similar dynamic exists for disk drives. Seagate and Western Digital both point to surging demand from hyperscale data center operators, who buy drives directly and install them onto their custom server boards. Still, that demand is not enough to make up for the steady erosion of the bread-and-butter PC market or the declining sales to RAID storage systems makers like EMC. While the industry has consolidated down to just these two suppliers, plus Toshiba – which doesn’t supply the hyperscale market – eroding PC sales, weak enterprise demand and falling prices continue to plague the market (Exhibit 11). Analog semiconductor firms with significant business in power conversion chips, like Analog Devices, Linear, Infineon, Maxim, On, Texas Instruments and others, may also see weakness from the data center paradigm shift.
Companies Levered to Hyperscale
The biggest differences between hyperscale architecture and traditional clustered server enterprise data centers are the largely proprietary software platforms that allow the former to parse impossibly large computational problems onto massively parallel processors and manage unstructured data bases of apparently unlimited size, all using commodity hardware components. Still, the size and ambitions of these mammoth collections of interconnected data centers create demand for very high performance components for specific tasks. For example, typical enterprise data centers connect the different servers with 10Gbps Infiniband or Ethernet interfaces, while hyperscale operations are already moving from 40Gbps Ethernet to 100Gbps – the leading edge of the interface chips employed. Companies that supply these chips – Mellanox, Broadcom, Intel, Imphi, Lumentum, Microsemi, AMCC, Marvell, and a few others – should see growing demand and attractive margins (Exhibit 12).
Exh 12: Addressable Market Sizes For Select Data Center Component Players
These speeds also push demand for similarly fast switching fabrics – Mellanox reports strong demand for its open 100 Gbps switching chips designed to power “white box” switches controlled by operator-proprietary SDN networking software solutions (Exhibit 13). Configured switch maker Arista also reports growing sales of its 100Gbps products into cloud operators, an indicator of operator willingness to pay for top performance, even if, in this case, it requires supporting a system vendor’s proprietary software. This is also true for inter-facility optical links. Hyperscale operators still buy Transport SDN enabled long-haul optical ROADMs (reconfigurable optical add-drop multiplexers), transponders and switches from box vendors like Ciena, Nokia, Ericsson and Infinera. The big cloud players are still a small slice of demand for this equipment relative to traditional telecommunications service providers, but one that is injecting an element of consistent growth to what has been a fairly modest growth market underneath its dramatic cyclicality.
Exh 13: Networking Port Market Size and Forecast, 2015-19
Alphabet, Microsoft, Facebook, IBM and Amazon are also at the forefront of the building AI cloud revolution, which we believe could be the fundamental driver for the next era of technology (http://www.ssrllc.com/publication/a-deep-learning-primer-the-reality-may-exceed-the-hype/). The nature of deep learning algorithms – highly iterative against massive data sets – favors a processor architecture more akin to the graphics chips used for high performance gaming than the general purpose CPUs of traditional servers. Not surprisingly, these top hyperscale operators have begun to invest in GPUs. Alphabet has designed an ASIC – the Tensor Processing Unit – to fill this need, but most others have adopted Nvidia’s Tesla line of data center optimized chips. There is also some impetus behind the use of field programmable gate arrays (FPGAs) from Intel’s recently acquired Altera business unit – Microsoft has been trialing them – for some deep learning applications, with the advantage that the basic algorithms can be more easily modified as needed.