The following content may no longer reflect Microsoft’s current position or infrastructure. This content should be viewed as reference documentation only, to inform IT business decisions within your own company or organization.
Executive Overview
The overarching challenge Microsoft faces in developing enterprise solutions for business intelligence (BI) is the same challenge that faces our customers and peers: big data. With the cost of storage devices reduced to less than the value of the data they can store, it is more cost effective than ever to store as much information as can be collected. Data accumulates at an expedited pace—one estimate places the Internet alone at about 75 million servers with more than 500 million terabytes of data.
Data on its own is not useful. Gaining insight from all of that collected data by producing predictive analytic systems that maximize that data’s value business value is the focus of the work in big data at Microsoft.
Why you should care:
Big data can help unlock predictive trends and develop proactive guidance.
Analytical insights from big data can improve the quality of service levels.
The business collects a large amount of data as part of its daily operations. If properly understood, that data can provide important insights about customer needs, business efficiency, predictions for future opportunities, and much more.
Correlating proprietary data collected by business and combined with publicly available information (houses sold by zip code, laws passed in Congress, economic forecasts) brings new insights for the business and offers a strategic advantage for business planning.
Understanding the use of current resources ranging from network bandwidth to availability of natural resources can provide important predictions for a business’s ability to meet future demands or develop new technology to stay competitive.
Companies that aren’t leveraging big data might be putting themselves at a considerable competitive disadvantage in the near future.
What is Big Data?
Many people think about big data as only large datasets, but it is not just about that. Big data can solve new types of questions and create new opportunities. A multitude of data sources exist, including personal, organizational, public, and private.
Some examples of where Big Data is generated include:
Enterprise resource planning, supply chain management, customer relationship management, and transactional web applications are classic examples of systems-processing transactions. Highly structured data in these systems is typically stored in Microsoft® SQL Server® databases.
Web 2.0 is about how people and things interact with each other or with your business. Web logs, user click streams, social interactions and feeds, and user-generated content are classic places to find interaction data.
Big Data. The number of devices and technology that generate ambient data has increased. Sensors for heat, motion, pressure, and radio-frequency identification and global positioning system chips within such things as mobile devices, ATMs, and even aircraft engines provide just some examples of “things” that output ambient signals.
The Open Data Initiative means more and more governmental data is being made publicly available.
What are the attributes of Big Data?
Volume. Volume-size of data is relative to the context of the current time. Like Moore’s law for computation power doubling every 18 months, the same is true of data.
Velocity. This is the rate at which data arrives at the enterprise and is processed or well understood.
Variety. This has to do with all the various sources of available data in all forms, formats, and shapes. For example, structured data and unstructured data are often used, but to clarify, all data used within the big data context has some structure. When we refer to unstructured data, we are actually referring to the subcomponents that don’t have structure, such as free-form text in a comments field or the image in an auto-dated picture. Big data is any type of data—structured and unstructured data such as text, sensor data, audio, video, click streams, or log files.
Complexity/variability. This refers to the variability of meaning as distinguished from the variety of formats.
Veracity. This is about trusting the data being consumed. How can data be acted upon if it can’t be trusted? Establishing trust in big data presents a huge challenge as the variety and number of sources grows.
Visibility. To make informed decisions, you need to have access to and be able to see all of the data that is required to help you make those decisions. Visibility is needed at the application layer to identify emerging trends within dynamic data streams, but the underlying infrastructure can act as sensors.
How does Big Data differ from traditional BI?
Where traditional BI relies on limited data sets, cleansed data, and simple models and primarily supports causation (what happened and why did it happened?), big data analytics uses many diverse and uncorrelated data sets, thrives on raw data, and uses ultra-complex predictive models. It is bent toward correlation (multiple, unrelated data sources turn up insights that cannot entirely be explained).
Traditional looks at descriptive and diagnostic analytics focused on what has happened. Traditional BI implies analytics and reporting on structured (SQL) data held in relational databases in tables. In many instances, big data processing includes a higher level of unstructured data in combination with structured data. The real difference between the traditional BI analysis and big data analysis is in the results you are looking for. With traditional BI analysis, a clearly defined set of input data exists that has a (reasonably) well-defined set of information behind it. The analysis is retrospective and provides information about what has gone on in the past. With big data analysis, the data is less well defined in meaning but provides information for the future—not just a trend line from past experience but predictive based on complex data.
Analysis Focuses on Value Drivers
Big data analytics (aka big data advanced analytics) focuses on predictive and prescriptive analytics future trends and how to take advantage of various trend as determined by customers’ criteria (e.g., desire to maximize profit but take on more risk, the desire to take less risk with modest profit or no risk with minimal profit). Each is a valid path a customer may choose, depending on the particular business strategy.
Using big data analytics, we can help identify value driver opportunities within the businesses, particularly around revenue, cost, and risk.
Enabling Big Data at Microsoft IT
We think of big data as actually having two parts: the big data technology itself, which allows us to hold and query large volumes of data, and the people—scientists and statisticians—who help us extract business value and drive business insight from the data.
Although the technology provides a scalable framework that allows us to structure and process large amounts of data, the scientists and statisticians create value for the enterprise by transforming data into analysis solutions for real-time decision making and implementing these solutions in a production environment for access by business users.
Extracting information out of various types of data takes different skill sets. Data science is a multidisciplinary field: It is important to form a team with a variety of strong quantitative talents.
Solution
Microsoft built and staffed Big Data Analytics & Platform Services.
Microsoft IT invested in becoming an innovation leader in Big Data–Big Math.
Microsoft IT now can offer services ranging from big data architecture, design, and development to operationalization of big data analytics capability supported by data scientists, customized for the business.
Challenges
Federating external data sources. Already, most of the data Microsoft uses comes from external sources—partners, customers, vendors, government, industry, social media—and the trend is only growing. The challenge is to federate external data with internal data so that it’s accessible and usable whenever the business needs it. That requires systems that validate external data rather than trying to control it.
Producing predictive analytic systems. Traditional BI systems produced a rear-view mirror look at data; those systems can only analyze the results from what has already happened. Big data offers more than that. Big data analytic systems can predict what is going to happen—first to gain greater competitive advantage, and second to respond to the new customer relationships that will increase in the devices and services world, particularly the world of continuous online services.
Big Data Analytics Architecture
The Data Decision Sciences Group (DDSG) enables customers to convert their raw data into credible, consistent information by enriching data through enterprise information management capabilities and advanced analytics.
SQL Server provides strong data transformation capabilities through SQL Server Integration Services, data cleansing through SQL Server Data Quality Services, and data governance through SQL Server Master Data Services. Currently in the big data space, Apache Hadoop is commonly seen as the solution to deploy. Hadoop is an open source framework that supports data-intensive distributed applications. The Hadoop platform includes the Hadoop kernel, MapReduce, the Hadoop distributed file system, and a variety of other projects, such as Apache Hive and HBase, giving customers the ability to store and harness unstructured and complex data types on commodity hardware. Hortonworks is a Gold Partner that built a Hadoop distribution that runs on top of Windows Server® at Microsoft.
For predictive analytics, DDSG offers data-mining tools in Microsoft SQL Server Analysis Services. Through Microsoft’s self-service tools as well as the data-mining add-ins, you can access and mash up data from virtually any source, including data from the Windows Azure™ Marketplace, and continue to refine those data sets to create compelling analytical applications.
Predictive analytics. Microsoft provides out-of-the-box data-mining algorithms with SQL Server Analysis Services:
Forecast sales and inventory, and discover which items tend to be sold together.
Identify the most profitable customers, and anticipate customer losses.
Uncover unintuitive relationships in data.
Look for themes and trends in unstructured text.
Identify and handle anomalies during data transfer or data loading.
For advanced analytics, DDSG supports commonly used non-Microsoft tools and frameworks such as Apache Mahout and R and use the marketplace to tap into these latest analytical techniques from the community of data scientists.
Big Data Analytics Engagements at Microsoft
The High-Performance Big Data Platform and Analytic Services support advanced analytic computing over large, complex, diverse data sets (and often varied data types). Microsoft IT data scientists partner with the business to deliver actionable business insights, using their data to help guide decision making. Microsoft IT’s Analytic Services offering supports advanced predictive modeling, text mining, experimental design and scenario testing, variation detection, statistical surveys, and system simulation and forecasting.
A business owner begins the engagement with the DDSG by identifying the business problem to address. It is important at this stage to have some idea of the kind of data and about how much data is relevant for the analysis. Next, the business owner works interactively with a project manager and a data scientist to capture the business and data requirements so that DDSG can help formulate an analysis that will provide the decision-making information the business needs. By its nature, this is an iterative process and requires interaction to be sure the analysis will result in valuable business decision-making information.
The DDSG runs the analysis and works with the business partner to refine and get results. For some business problems, that will be the end of the engagement for that particular business problem. In other cases, the analysis will need to be operationalized so as to provide ongoing information as new data is collected.
Big Data Projects at Microsoft
The DDSG at Microsoft has a history of delivering results in big data. This section describes some of the projects for which they helped business owners make decisions based on data tempered with experience.
User segmentation
Built utilization-based customer segmentation by analyzing the click stream from the Windows Telemetry panel
Determined how customers use PCs
Segmented customers based on usage patterns
Segmented for product planning in FY13
Applied advanced analysis—cookie data
Performed analyses that SQL Server couldn’t handle
End user profile (EUP)
Improved and provided insight and established a process to identify potential licensing shortfalls for Microsoft products and provided actionable BI to enable cross-company antipiracy work
EUP project for software piracy detection
Found potential piracy scenarios and built user profiles
Built a predictive model to identify piracy
Analyzed and modeled data from disparate data with high volume and velocity
Unlicensed PCs
Analyzed the behavioral trending of new Windows® 8 devices in the original equipment manufacturer channel, downstream distributors and resellers who are not properly licensed, and subsequent impact on return on investment
Multi–billion-dollar business decisions
Marketing spend effectiveness
Segment analysis
Partner behaviors, cycle times, and trends
Software piracy
Dependency on data integrity, quality, security, and governance
MS.com
Targeted visitors who showed an interest in Surface™, Windows Phone, or Xbox® on the basis of their MS.com or Windows Store behavior
Identified potential customers for Windows Phone 8, Surface, and Xbox based on browsing behavior and created banner ads directed toward these likely customers
Combined data from a variety sources to target customers
Used more than a terabyte of cookie data
Benefits
Able to get insights out of big data at Microsoft
Enabled enterprise-wide decision making
Microsoft IT is becoming an innovation leader in Big Data–Big Math
Conclusions
Data storage costs are down. A vast collection of available data from a variety of sources can now be federated and analyzed.
Microsoft products, tools, services, and technologies work with non-Microsoft products to deliver big data analytics.
Microsoft IT built a big data analytics platform and analytic services for the enterprise.
Prescriptive and proactive analytics enable enterprise-wide decision making and drive business value.
Resources
Microsoft Big Data
www.microsoft.com/bigdata
Microsoft BI Blog
http://blogs.msdn.com/b/microsoft_business_intelligence1
Windows Azure
www.windowsazure.com/en-us/home/scenarios/big-data
SQL Server
http://www.microsoft.com/en-us/sqlserver/solutions-technologies/business-intelligence/big-data.aspx
Preview of the Windows Azure HDInsight Service
http://technet.microsoft.com/en-us/library/hh315814.aspx
Microsoft Big Data Solution Sheet
http://download.microsoft.com/download/1/8/B/18BE3550-D04C-4B3F-9310-F8BC1B62D397/MicrosoftBigDataSolutionSheet.pdf