AI Data – Food For (Artificial) Thought

Print Friendly, PDF & Email

SEE LAST PAGE OF THIS REPORT Paul Sagawa / Artur Pylak


psagawa@ /

twitter.jpg @PaulSagawaSSR

October 31, 2016

AI Data – Food For (Artificial) Thought

Data is the fuel for deep learning – the more data, the better insight can be gleaned as the systems constructed from many thousands of recursive algorithms, organized into layers of increasing complexity, adjust themselves through millions of iterations. As such, for companies that have it, along with experienced scientific talent and access to AI-tuned hyperscale datacenter capacity, data is an asset with extraordinary potential value, driven by three key factors. 1. Relevance – the potential markets that could be addressed using the data; 2. Quality – the scarcity, freedom from restriction, ability to be integrated with other data sources and the condition of the data (free from errors, gaps and other shortcomings); 3. Size – the number of different records; the detail provided for each record (i.e. the number of attributes and their specificity); and the number of data points associated with attribute. Using these attributes, we ranked 15 different companies based on their data assets, and evaluated potentially valuable data holdings for 5 industry sectors. We believe GOOGL, FB and AMZN – in that order – stand out in the value of their information for deep learning initiatives. We note that AAPL could be a data leader, but chooses not to be, and that TWTR’s unique data assets should be attractive to potential buyers.

  • Data is critical for deep learning. Successful deep learning systems development requires three things – talent, datacenters, and mountains of data. While talent ( and AI-tuned hyperscale datacenters ( are the province of a privileged few, data is everywhere. There is an estimated 11 zetabytes (trillions of gigabytes) of digital data today, growing at a trajectory to hit 44 zetabytes in just 4 years. The most valuable data can shed light on high potential markets, is unique to a particular company, is linked to other valuable datasets, is unencumbered by restrictions, and is complete and well formatted. Moreover, for deep learning purposes, the best datasets will have as many individual records, will span as long a time period, and contain as much granular detail, as is possible.
  • Data, three ways. First, the internet and the explosion of smartphones is driving an explosion of consumer data in five main areas: demographics, interests, activities, connections, and records. Second, enterprises are also building mountains of data from ordinary business operating activities, from specialized industrial areas (e.g. drug discovery, financial trading, energy exploration, etc.), and from electronic activity itself, which creates important data in areas like employee productivity, or system security. Finally, institutions, primarily government and academia, also generate substantial datasets around the economy, scientific research, security/law enforcement, and the community.
  • Relevance – addressable monetization opportunities. Some data is obviously valuable, with implications for better products and/or greater efficiency in large, well-established areas of the economy. The initial beachhead for commercial deep learning began with data on consumer interests and actions used for targeting ads, streamlining retail sales, personalizing devices and cloud services, etc. – markets worth trillions of dollars globally. Other major addressable markets: transportation, health care, workforce productivity, financial services, manufacturing, wholesale distribution, energy, agriculture.
  • Quality – Scarcity, freedom, integration and condition. Data is subject to the laws of supply and demand. Information easily available to anyone is less valuable than a unique and proprietary data source. Likewise, the value of proprietary data can be seriously encumbered by regulation and other barriers that constrain its use. Datasets can also have characteristics that make it easy or difficult to integrate with other data – easily integrated data is obviously more valuable. Finally, information that is riddled with random errors, inconsistent formats, missing records, and other flaws is less valuable than scrubbed, comprehensive datasets.
  • Size – Breadth, length and depth. The amount of data is a key factor in drawing unique insights from deep learning, and has three dimensions. Breadth is the number of records in a dataset – for example individual consumers or specific video files. Depth is the number, detail and interconnectedness of various attributes within each record – e.g. rich, multi-category user profiles or comprehensive medical records. Length refers to the number of related data points in each attribute, often related to the time over which they have been collected. Generally, more is better on all three dimensions.
  • Consumer internet giants are way ahead. The rise of the web forced its early leaders to deal with a tsunami of data, followed by a mobile wave an order of magnitude, or more, larger. These companies also had the foresight to set policies yielded high quality datasets. Addressing huge opportunities, like advertising (~$550B), retail (~$6T), media distribution (~$632B), and others, information is already a formidable weapon for companies like GOOGL, FB, and AMZN, and is becoming an even bigger differentiator with the accelerating adoption of deep learning systems. Other consumer internet datasets with significant value include Snapchat, TWTR, NFLX, and potentially, AAPL
  • MSFT leading data-driven enterprise competition. SaaS applications give their hosts an opportunity to analyze the data on their systems for the benefit of their specific clients. Here, MSFT, successfully transitioning to Office 365 and buying LNKD, has powerful assets. To a lesser extent, CRM can leverage its customer’s sales data to provide better service to them. IBM has acquired some specialized data sets (e.g. health care records, weather) and partners with clients to develop deep learning solutions for their data.
  • Data in many traditional companies is of low quality. Many companies in many sectors of the economy tout their data holdings as important assets –e.g. retailers, banks, credit card nets, health care providers, airlines, etc. Often, these data assets are of low quality – incomplete, encumbered by restrictions, poorly linked, and/or undifferentiated – making them much more difficult to monetize. Furthermore, few of these organizations have the talent or datacenter resources to independently develop market-leading AI, and will need outside help.
  • Valuation. There are almost no true data pure plays. IBM’s $2.6B deal for Truven was almost entirely data driven, and the premium for MSFT’s $26 acquisition of LNKD seems based on its data. We believe that share prices for data heavy companies like TWTR, NFLX and others are undervaluing their information assets in the coming AI era.

I Know What You Did Last Summer

In the past weeks, we have written about the scarcity of experienced scientific talent ( and how a small number of prescient companies have built robust, sustainable communities within their organizations that give them substantial advantage on big opportunities. We have also written about the infrastructure needs of major AI solutions (, and how GOOGL, MSFT and AMZN are all working to deliver value-added AI hosting services on their platforms. Data is the third vital ingredient to leading edge AI, one that at first glance seems bountifully present for many would-be AI players. Still, not all data is created equal.

We see three factors that determine the value of a company’s data – relevance, size and quality. First, does it help to address large, attractive market opportunities. Second, how much data does the company have – how many individual records, with how much detail and stretching back how long? Finally, how good is the data – is it rare, is it unencumbered by regulation, is it clearly tied to other relevant data, and is it free from errors and gaps? With companies from all corners of the economy touting the value of their data assets, we thought it appropriate to assess that value for companies and market sectors across those three factors.

The rise of the internet and the subsequent explosion of mobile devices has put the leaders of the digital economy in a power position. Most of them were forward-thinking enough to capture and archive extremely detailed profiles of their users and their activities. Some collected valuable information as a part of the service to their users – web page indexes, maps, content archives, product reviews, etc. This data, across companies, is generally of high quality – unusual, unrestricted, cross-referenced, complete and clean. The consumer data is very relevant to attractive target markets for AI – advertising, and e-commerce chief amongst them. Other companies, with a more tangential relationship with the internet, typically have significantly smaller datasets of much lower quality. For example, brick-and-mortar merchants have a difficult time tying sales data to specific, targetable consumers and banks know where their card holders shop, but not what they buy, and have restrictions on their ability to use their customer data.

Non-consumer business data is also potentially valuable. AI could dramatically improve patient triage, diagnostics, treatment protocols, drug discovery and a host of other areas, IF the data were available, well integrated, and unrestricted. It’s a big IF – shoddy records, fragmentation, rigid privacy restrictions and other data shortfalls are substantial obstacles – but the opportunity could be huge for companies that hold pieces of the puzzle. Transportation, agriculture, manufacturing, and energy are other examples of industries where companies hold data that could be invaluable fodder for deep learning systems. We note that governments and academic institutions are also shepherds of important datasets that could be fuel for commercial ventures.

Not surprisingly then, GOOGL, FB, and AMZN have the most valuable AI data assets. Other sizeable consumer internet franchises, such as AAPL, MSFT, TWTR, VZ/YHOO, NFLX, and PCLN, fall in the next tier, advantaged relative to traditional consumer facing companies – like banks/credit card nets, retailers (and other merchants), etc. – which typically suffer from poor data quality. Poor quality is endemic to most health care datasets, which also suffer from extreme fragmentation, making IBM’s 300M clean patient records intriguing and highly relevant. Datasets held by industrial companies tend to be idiosyncratic – narrowly relevant, but often sizeable and of good quality.

Brains, Brawn and Beauty

The recipe for leading-edge deep learning systems has three ingredients: a crack team of experienced scientists, ample computing capacity in an AI tuned datacenter, and mountains of useful data (Exhibit 1). Deep learning scientific talent is in short supply, as a field perceived as an esoteric dead end just 15 years ago has rapidly emerged as the hottest area in computer technology. A few companies were prescient enough to see the coming rise of AI, building impressive rosters of leading thinkers and establishing durable advantage looking forward (Exhibit 2). We wrote about this extensively in our recent publication (

Exh 1: Requirements for AI and Neural Networks

Exh 2: Distribution of Highly Cited (5,000+) AI Scientists by Organization

We have also written about the particular processing needs of deep learning systems and the move by the largest IaaS operators – Amazon Web Services, Microsoft Azure, Google Cloud Platform and IBM Watson Cloud – to offer hosting solutions, including proprietary tools and libraries, for 3rd parties ( These companies (unlike fellow hyperscale data center operator and AI talent leader, Facebook), have chosen to use their infrastructure advantage to drive demand for commercial hosting and promote their preferred AI development tools as de facto standards rather than reserve them for proprietary use. This democratizes deep learning to a certain extent, at least for one of the three main elements (Exhibit 3).

The last piece of the puzzle is data. The rise of the internet sharply accelerated the creation and capture of digital information, so much so that Google had to invent a new, scalable methodology for storing and indexing data to build its first search engine. That technology, later contributed to the open source community, became the basis of the modern hyperscale data center. When the internet turned mobile, spurred by the introduction of the iPhone, data collection accelerated once again. Today, more than 2 billion smartphones are in the hands of consumers, who are taking photos, sending messages, making purchases, searching for information, watching videos, reading stories, and just moving about. Businesses are generating data as well – documents, communications, transactions, customer files, production records, trouble reports, you name it. These are also multiplying at a furious pace in the mobile-cloud era. Scientists believe that the total sum of the world’s digital data is into the dozens of Zettabytes – one Zettabye is a trillion gigabytes – with more than 600GB of stored information for every single human being on the planet. This has grown fivefold over the past 4 years and is expected to quadruple in the next 4 (Exhibit 4).

Exh 3: Deep Learning Services Offered by the Big 4

Exh 4: Data Generated and Stored Globally, 2012-2020

This massive universe of data is the fuel for the AI era. Deep learning systems work via tiny incremental adjustments to the feedback mechanisms programed into them. As each piece of data is processed by each algorithm in the system, the outcomes are evaluated and the algorithms are automatically changed with the goal of achieving a better outcome on the next iteration. As the system works through many, many data points in a data set, and iterates through the whole sets over and over, it gets better, and better at returning outcomes that match the criteria for success. This is how the computer program learns, and the more and better quality data to which it has access, the more it can learn (Exhibit 5-6).

Exh 5: The Basic Deep Learning Reinforcement Learning Algorithm

Exh 6: Hierarchy of Deep Neural Networks

Exh 7: A Typology of Digital Data

Fantastic Data and Where to Find It

We see three major buckets of digital data available to companies hoping to build deep learning systems (Exhibit 7). The first is data generated by consumers. Most of this is generated by on-line activity, although data collected through more traditional means – customer records, transactions, loyalty programs, etc. – remains an important subset. This information can be categorized into five areas: Demographics, Interests, Activities, Connections and Records. Demographic data are descriptors about the users themselves – names, addresses, birthdays, employers, academic history, families, group identifications, etc. Interests are typically revealed by actions – purchases, searches, follows, likes, clicks, and such. Activities are things you’ve done and plan to do – schedules, location trails, transactions, login records, site visits, and other things. Connections are your interactions with your community – contact lists, emails, messages, club memberships, friend lists and personal networks. Records are files created by or for each user – photos, videos, and written posts, but also health care and financial records as well.

Businesses generate their own data. In the course of business, enterprises will build datasets around their customers, their suppliers, their internal processes, their financials, their employees, etc. Organizations in specific verticals may have special needs – doctors, hospitals and health insurers keep detailed patient records, financial institutions keep copious and well-regulated records of their customers’ activities, pharmaceutical companies have data on drug discovery, energy companies rely on geological surveys, and agricultural companies track weather and soil conditions. Finally, all enterprises that rely on electronic systems have data about those systems – usage data that could be used to unlock productivity, improve security, better serve customers, and many other purposes. Finally, governments and other non-commercial institutions have data too – censuses, economic data, legal records, scientific surveys, and other collections. This data is often available to commercial businesses and can be of substantial value to companies.

Exh 8: Factors Determining the Value of AI Data

What is it Good For?

Three factors determine the value of data for AI (Exhibit 8). First is relevance – what market opportunities could the data help a company address? The second is quality – is the data a scarce resource? Is it free from restrictions (regulation, policy, etc.)? Is it in a format that can be easily integrated with other data? Is it in good condition, scrubbed of errors and without gaps? Finally, the third factor is size – are there a lot of records in the dataset, with substantial detail, going back a long time?

Of these, relevance is the most obvious – quality and size aside, what is the TAM that systems built using the data might address? Collectively, the answer might be the entire world economy, as nearly any transaction could conceivably be enhanced with greater information, but several areas stand out (Exhibit 9). The first arena to have seen the impact of data-driven deep learning systems may have been advertising, where AI models built on consumer demographic, interest and activity data have been guiding digital ad placement for several years. Globally, measured media advertising is a nearly $600B industry, with similar amount of spending on non-media advertising and promotion also addressable. Retail is an even larger sandbox, with more than $20T in spending worldwide. For both purposes, data that helps to predict consumer intentions is immensely valuable. Consumer data can also be used to directly improve the functionality and personalization of devices – smartphones, tablets, computers, automobile user interfaces, smart home equipment, etc. – together, a $540B+ global market. Video, image, sound and text files can help AI systems better recommend media content, better classify user generated content, and enable new forms of content, such as augmented reality. Infotainment is a $20B+ worldwide market.

With autonomous vehicles now ever present in technology commentary, transportation is an obviously germane market, with both consumer and commercial opportunities addressing more than $6T in annual spending. Mountains of test driving data are obviously crucial, but detailed map data, traffic data, weather data, and other inputs will also be important. Health care is another massive opportunity, with annual spending of nearly 20% of US GDP – AI could allow much better treatment with fewer resources through systems to predict the spread of disease, to triage patients, to diagnose and even predict illness, to interpret medical imaging and tests, to evaluate treatment protocols, and to discover new drugs. Agriculture firms could make use of weather, soil, crop performance, and other data. Energy companies could make better use of geologic information. Governments are beginning to use crime data for AI systems that help solve cases and anticipate problems. Financial firms, and other enterprises, can use transaction and network activity data to root out fraud and anticipate security threats.

Companies could use the data generated by their operations to improve those operations and make them more effective and efficient – more insightful sales support, better order processing, sharper inventory management and logistics, intuitive customer service, more useful financial tools, etc. Cloud based hosts working with those companies could help. AI could also catalyze more powerful user interfaces, tapping operational data to anticipate work needs, and employees could be better matched to tasks and teams that suit their skills and relationships.

Exh 9: Select Global Market Opportunities Addressable by AI

Data Quality is Job One

Not all data is created equal. The most valuable data is unique, or at least scarce, offering unusual insights to differentiate the deep learning systems built from it. Scarcity expands the value of those near monopoly consumer internet franchises – Google in search, Facebook in social posting, Amazon in e-tail, etc. – with AI fueled by the flood of data enabling personalization and functionality unavailable to would-be competitors. Even services that reach smaller audiences – like Twitter or Microsoft’s LinkedIn – may have substantial value in the uniqueness of their data assets.

Freedom from encumbrances – be they government regulation, legal restrictions, contractual agreements, company policy, or anything else – is another key aspect of data quality. For example, health care records, which could be invaluable toward building lifesaving and cost reducing deep learning based solutions, are locked up by stringent privacy regulations. The same is true for most financial data as well. Meanwhile, most internet businesses have openly collected as much data as they could from their users with few promises for privacy. Some jurisdictions, in particular Europe, have looked to restrict this use, but for the most part, companies like Google, Facebook, and the like are free to do as they like. Apple is notable here for its self-restricting policies around data collection and use.

Some datasets have characteristics that make it easy to tie them to other datasets and to use multiple layers of detail within the dataset itself. For example, data records that contain verified markers – real names, social security numbers, device IDs, etc. – can be matched to other records with the same markers. We call this attribute integration, and it can be a significant issue for data owners. For example, brick-and-mortar retailers have comprehensive datasets containing every transaction executed in their stores, but lack the means to track those sales to specific customers – unless those customers have joined a loyalty program and identified themselves at the point of purchase. Similarly, cable operators know when the individual TVs in a household are on and to what channel they are tuned, but don’t know exactly who is watching and can’t link to other information about a particular viewer. In contrast, logged in users on YouTube can be linked to their search history, their Gmail activity, their location, and other user-specific data points tied to the login information. This is very valuable.

Finally, data condition is an important aspect of quality. Obviously, a dataset that contains a lot of errors is less valuable than one that has been well scrubbed and verified. Errors are often the product of collection – careless employees, poorly designed forms, etc. – but may also be endemic to the particular sort of information being collected. Condition may also be compromised by gaps in the data – incomplete/biased population samples and/or interrupted time frames – that reduce its value for deep learning.

Size Matters

In deep learning, more data is better than less data. As the system methodically works through data records, adjusting its algorithms to hone its results ever closer to optimum, each data point offers additional insight, which in aggregate, gives the technology its power (Exhibit 10). Of course, the size of a data set can be measured on numerous dimensions. The most obvious of these is breadth. How many records are there? Facebook has more than 1.6B registered users, and it can tie almost all of its collected data back to those individual users, each one constituting a record. Units do not have to be people – if the data pertains to devices, places or widgets, a data set may be organized on those things as the primary record. Again, more is better from an AI perspective.

Data can also have length. Typically, this is a dimension of time – how long does the information go back? Simple data sets, like a population census, may stretch back many, many decades. For example, British land records date to the Magna Carta in 1215. Scientific data may stretch back millions of years based on modern collection techniques like carbon dating. Modern electronic data typically has a shorter audit trail – Google’s archive of indexed webpages dates back to 1997 and its record of individual search histories goes back to 2005.

Finally, the most valuable datasets deliver depth. How much detail does each record contain? When Tesla brags about the 2 million miles of driving data that it is collecting each day and compares it to the 2 million miles of autonomous driving logged by Google’s autonomous car initiative, it is emphasizing breadth while completely ignoring depth. While Tesla is collecting simple telemetry data at less than 1 Mbps, Google is collecting a 360 degree 3D map of the conditions surrounding each car at nearly 1Gbps. Amazon knows where its Prime members live, for what they’ve shopped, what they’ve bought from Amazon and from merchants in its marketplace, what they like to watch on television, to whom they’ve sent gifts, and a host of other tidbits that can be gleaned from the range of services that it offers them. By comparison, traditional merchant competitors know what their loyalty program customers buy in their own stores IF they remember to use their card.

Exh 10: Number of Monthly Active Users by Property

Google – The King of Data

Google has 7 different product franchises, each with more than a billion users – Search, Android, YouTube, Chrome, Maps, The Play Store, and Gmail (Exhibit 11). These services – and several smaller ones, like Photos, Calendar, and the new Allo messenger – generate enormous quantities of data, most of it linked by a single Google sign in. This data already powers AIs that help to target digital ads, which comprise the large majority of the company’s nearly $90B in annual revenues. We believe that this opportunity has considerably more runway – the global measured media ad market is just below $600B, but Google can target other marketing spending (promotions, non-media advertising, etc.) which could more than double the total addressable opportunity. Google’s consumer information trove – including comprehensive demographics, and considerable detail on both interests and activities – is also potentially very valuable for e-commerce, opening more than $6T in TAM as the company pushes more directly into retail.

Through YouTube, Google also has the world’s largest video library (~1B hours with 400K hours uploaded daily), and, with the success of Photos, is also building a massive archive of still images. It is already using these assets to develop deep learning systems to automatically classify the contents of these files, technology that will improve the usefulness of the services to consumers (entertainment is a $632B TAM), but also enhance solutions for opportunities that are further afield, like autonomous driving and other applications of robotics. The self-driving car initiative has also built its own 100 Petabyte set of detailed driving data, and can leverage its 20+ Petabyte maps data base as well. This is in pursuit of the $7.7T global market for road transportation.

Exh 11: SSR Data Value Scorecard – Alphabet

The quality of Google’s data assets matches their size and relevance. Most of Google’s data is completely proprietary and unique. While the company is getting some pressure from regulators over consumer privacy, particularly in Europe, most of its users exhibit no expectations of privacy beyond what Google gives them. In this context, this mother lode of information is essentially unencumbered. Google has also been careful to make sure that its data can be easily cross-referenced, back to the user’s login, or to an IP address. Finally, the nature of the data makes it unlikely that there are any systematic problems with errors or gaps.

Facebook – 1.6B and Counting

Facebook has 1.6B monthly active users and 1.1B daily active users. Both of its messaging platforms, Messenger and WhatsApp, have more than 1B users. 500M people use Instagram at least once a month. All of these users are tied together with a common real name log in. Facebook uses the data gleaned from the activity of its users to target ads, addressing the same $1T+ in measured media and other marketing spending targeted by Google. Unlike its larger rival, Facebook is not actively pursuing AI driven opportunities farther afield, like e-commerce, media distribution, or transportation (Exhibit 12).

Facebook’s users contribute substantial amounts of data. In addition to the detailed demographics collected from every registered user, the friending process yields a comprehensive map of social connections within the database that the company uses as its primary basis for delivering value to its users. Those users also upload huge quantities of videos and photos, typically with descriptors to help with classification. Facebook claims to have over 3B hours of video viewed monthly and nearly 1 trillion photos on file.

Exh 12: SSR Data Value Scorecard – Facebook

We believe Facebook’s data assets are the second most valuable, very slightly behind Google in relevance for having a less ambitious target market and in size, for having less varied detail (location, maps, transactions, etc.). The quality is excellent, having been built from the start with integration in mind and from historically aggressive privacy policies.

Other Online Consumer Franchises

Apple is notable for its strong policy on protecting the data privacy of its users, but this does not entirely prevent them from gaining insights from their activities (Exhibit 13). While all communications are encrypted end to end, and much of the user activity data is held only on the device itself, Apple retains plenty of data – from App Store transactions to photo libraries stored on iCloud. The official policy is “Differential Privacy”, which uses statistical techniques to disconnect this data from anything that could be used to identify an individual user. This allows Apple to glean some insights on general user behavior to guide device centered products like Siri, but blocks the ability to use it for ad targeting or e-commerce applications. Because of this, the relevance, quality and size of these data sets is compromised.

Amazon has deep knowledge of a relatively small cohort of users – the roughly 90M households that belong to its Prime program. For these customers, Amazon has extensive demographic information, credit files, a detailed record of searching and buying across many product categories, a record of media consumption for the members that use its Video and music services, and a record of search and app interactions for the several million Echo and Fire tablet users. It uses this data to drive its extraordinary e-commerce franchise, but also has ambitions for advertising and other monetization levers. The quality of Amazon’s data is uniformly excellent.

Twitter has excellent data concerning the interests of its 300M+ regular users, based on who they follow and their interactions with the tweets in their timelines. The overall content of the tweets is also a unique data resource, revealing moment to moment the issues of highest interest to the user base and specific, emerging details about those issues. Today, Twitter uses that data to place ads, but we believe that the data could be used to address further markets and that revisions to the service and more effective marketing could accelerate user growth, perhaps under the leadership of an acquirer.

Netflix has substantial data on the viewing habits of its nearly 87M subscribers, capturing not just the shows that are browsed and chosen, but also the exact moments when streams are begun, paused or dropped. This data is already useful in recommending programming and in predicting the popularity of proposed new content. While the targeted market is narrow, compared with broader consumer cloud operations, the quality of the data is excellent, and growing rapidly in size as the subscriber base and its engagement continues to grow.

Snapchat, Pinterest, and other social network operators fall well behind the leaders on the size of their datasets – fewer users with less engagement, less detail and shorter lifespans. Quality may be a step below as well, particularly for Snapchat, which purposely deletes much posted information after specific time intervals. Specialized app companies, like the OTAs, Yelp, and others have demographic information on their registered customers integrated with a record of their on-site activity. Of course, many users are infrequent visitors to these apps, and many split their loyalty amongst several providers, diluting the value of their record.

Exh 13: SSR Data Value Scorecard – Other Consumer Franchises

Brick-and-Mortar Players Are Often Data Challenged

Many traditional consumer-facing enterprises – retail and restaurant chains, airlines, hoteliers, banks, credit card networks, insurers, health care providers, and others – generate information about their customers in the course of their businesses, and many are vocal in touting the value of their data assets. However, the practices established for data collection in these arenas were forged long before the emergence of the cloud or “big data”, yielding considerable issues for data quality.

Exh 14: SSR Data Value Scorecard – Brick and Mortar Players

For retailers and restaurants, data is built around individual stores rather than customers. For the vast majority of transactions, the buyer is anonymous disappearing into the ether after purchase and unrecognized in future interactions. Only when shoppers can be enticed to enroll in and regularly use a loyalty program can a retailer begin to build an effective profile and target customers at an individual level. A few retailers and restauranteurs have made this jump – Starbucks is the poster child – but most continue to struggle. This effects both data quality and the depth of the dataset. Travel service providers have been more successful in building their loyalty programs, but their data is narrow – particularly for infrequent travelers – and often trapped in inflexible formats that hinder integration.

Banks like to talk up their data assets. However, while a big bank has very detailed and seemingly invaluable financial information about tens of millions of consumers, its use of that data is tightly proscribed by regulation. Moreover, credit card issuers and networks may know the “how much” and “where” of your spending, they do not know the “what”. Retailers are militant about keeping purchase level data from the payments providers, and new technologies around tokenization will only strengthen their hold on this data. Moreover, data tied to individual card numbers and customer accounts are not necessarily easily integrated with other data that would allow direct targeting for advertising, again, particularly once tokenization becomes more commonplace. All of this hits the quality of the data relative to possible AI.

The health care industry also faces tight regulation of its patient data for privacy concerns. Moreover, the fragmented nature of the industry, combined with rigid data record formats and inconsistent collection standards that make it very difficult to combine data sets or integrate them with other data, leaves providers struggling to make use of deep learning despite its extraordinary potential for improving care and reducing costs. The relevance is very high, but the size of each company’s share of the data and the generally poor data quality stand in the way.

Enterprises Have Data Too

In the course of their business, most enterprises generate data. Some of that data is in categories that are fairly common across industries – sales data, service data, production data, logistics data, financial data, HR data, etc. – typically managed by ERP software systems. Companies also generate other types of data – employees communicating with each other, or with customers, suppliers or partners; documents, spreadsheets or engineering diagrams being worked on or circulated; records of the enterprise’s systems including logs of access attempts both legitimate and fraudulent; just to name a few. While almost all enterprises own their own data, increasingly, the systems that generate it are operated by cloud-based hosts. For SaaS operators, this allows them to use the data, not indiscriminately, but on their customers’ behalf.

This is Microsoft’s data strategy (Exhibit 15). It has some data of its own, generated by Bing, Skype, Xbox One, and other cloud franchises. It is acquiring LinkedIn, which will give it a raft of additional data about working professionals, their connections and their interests. However, it is building what will likely be its greatest data asset as it transitions its massive Office customer base to Office 365, and begins to archive the activity within each organization. That data can be used to personalize Microsoft’s applications to every customer, with tools that allow management to assess productivity across the enterprise. In this, Microsoft gets an asterisk for the value of its data – currently it is of good quality but middling relevance and size, but it will rise sharply on all three dimensions should its strategy play out as planned.

Salesforce already has all of its customers’ customer data on its cloud, and can use it to develop similar sorts of management and user beneficial AI tools. Relative to Microsoft, both its current data assets and their future potential are smaller and address a smaller corner of the enterprise IT market. Other SaaS operators – Workday, Service Now, NetSuite, and others – are at least another level smaller than Salesforce in the size and relevance of the customer data under their control. We note that packaged software leaders – like Oracle and SAP – do not have access to the data generated by their applications or held in their databases at all unless explicitly invited in by their customers.

Exh 15: SSR Data Value Scorecard – Microsoft and Salesforce

IBM is Going Vertical

Many industries have specialized data assets intrinsic to their particular needs. Energy companies rely on geologic surveys. Agriculture firms use weather and soil condition data. Pharmaceutical companies build datasets on molecules, genomes and clinical tests. Telecommunications carriers painstakingly monitor their network operations. Governments and other institutions also have significant banks of data – censuses, economic statistics, voting results, legal proceedings, crime records, funded research, etc – with significant value for society or to commercial interests. The relevance, size and quality of these data assets is largely idiosyncratic and can be difficult to assess from outside of the organization.

Exh 16: SSR Data Value Scorecard – IBM

Few of the organizations that control these sorts of datasets have the AI talent or hyperscale processing platforms needed to build deep learning applications from them. Here, enterprising IT players, in particular IBM, have moved to partner with companies and governments to build the systems based on their data (Exhibit 16). IBM has also acquired some vertical data directly – the purchase of Truven Health combined with previously acquired data, gives it 300M well-formatted patient records, and its deal for The Weather Company yielded a detailed history of global weather conditions that it used to build an AI model that forecasts weather on a hyper-local basis. We believe IBM’s strategy of identifying and buying unique data assets with an eye toward developing deep learning systems for verticals is well differentiated, and places it well on our list of dataset valuations.

How to Value Data

There are very few data points as to the market value of data. There have been acquisitions of companies where the primary rationale was data – IBM’s $2.6B deal for Truven Health Care and its 200M well-scrubbed patient records and its $2B purchase of The Weather Company are the best examples. There are publicly traded companies whose value largely rests on their data assets – the three big credit bureaus, Experian, TransUnion and Equifax would put the value of clean data on consumer credit worthiness at somewhere between $5 and $14B (Exhibit 17). Arguably, data assets are a considerable piece of the valuation for Alphabet and Facebook – GOOGL trades at more than 3 times the S&P500 sales multiple, and FB’s 16.8x ttm P/S is more than 8 times the average – suggesting data values into the $100’s of billions.

Rather than placing a specific dollar value on each company’s data assets, we’ve ranked the top 10 cloud-based US companies based on a subjective assessment of the relevance, size and quality of their holdings (Exhibit 18). We also used the general framework to offer blanket perspectives over the value of data held by companies in certain sectors, such as retail, financial services and health care.

Exh 17: Sales Multiples of AI leaders vs. Credit Bureaus and the Market

Overall, we believe that Alphabet’s data assets are easily the most valuable – not only does its extensive and detailed information on consumers position it to deliver deep learning systems to help it deliver more effective ads, participate more fully in e-commerce, enhance the value of its hardware and software products, and position it to offer new consumer services (travel, media, etc.), but its driving and maps data give it enormous advantage in addressing the paradigm shift to autonomous transportation as well. Facebook is a clear number two, falling short mainly on the breadth of its ambitions. Amazon rounds out the top three, trailing on the size of its data base – its 300M shoppers and 90M prime members fall well short of the Google/Facebook standard.

A couple of companies of interest: First, Apple could rank much higher on the list but for its self-imposed privacy restrictions – there is much data that it chooses not to collect while its rivals impose no such restrictions. Second, despite its well-publicized struggles, Twitter’s data could be very valuable to the right owner, with excellent depth on consumer interests, and a unique and comprehensive map of recent historical trends.

Exh 18: SSR Data Value Scorecard – Top 10 Cloud-Based US Companies

©2016, SSR LLC, 225 High Ridge Road, Stamford, CT 06905. All rights reserved. The information contained in this report has been obtained from sources believed to be reliable, and its accuracy and completeness is not guaranteed. No representation or warranty, express or implied, is made as to the fairness, accuracy, completeness or correctness of the information and opinions contained herein.  The views and other information provided are subject to change without notice.  This report is issued without regard to the specific investment objectives, financial situation or particular needs of any specific recipient and is not construed as a solicitation or an offer to buy or sell any securities or related financial instruments. Past performance is not necessarily a guide to future results.

Print Friendly, PDF & Email