Tools for Big Data and Cloud Analytics

Businesses are at risk of drowning in data.  In recent years, there has literally been an explosion of the amount of data available to businesses about their customers, their inventory, and the state of their production, logistics and transmission systems.  All of this data, be it from social media, transactional systems or monitoring and control systems is only useful to the extent that it can be aggregated, integrated and analyzed to produce business insights.

While a recent study by Enterprise Management Associates (EMA) indicates that literally every company, large and small has at least one ‘big data analytics’ project underway, business leaders also report frustration is getting actionable insights from that data that they can use to positively affect their business results.

In this blog, we will discuss how a new generation of tools from Dell Software and others can be used to integrate data from various sources, organize it in a fashion that makes it useful and then analyze it to understand what it is telling you about your customers and processes, and how they are likely to behave in the future.

Start with a Data-Based Hypothesis

The first step in this process is to have a hypothesis about what problem you are trying to solve, and what you want the data to tell you; which then leads to understanding where the relevant sources of data to give you that insight might be.  For example, if you are a retailer and want to understand customer buying behavior, you might start by bringing together transactional information from your stores (what did you sell) with customer sentiment information from social media (what are people talking about?) with local demographic information (where are my customers) with weather forecasts.

All of this data exists, however it is in many different forms, some internal to your business and many external…so the first step in any big data project is typically data integration, just getting access to the information you need. Since much of the data may be external to your business, the cloud, and cloud based tools like Dell Boomi will be incredibly useful to pull the data together.  Historically people would collect all that data into a data warehouse, however in a big data world, the sheer volume and time sensitive nature of this information usually means that you are going to pull the data from its source as you need it.  Just in time data integration is literally the only way to do this practically.

Extract the Information You Need

Now that you have identified the data and have a strategy for integrating it in a timely way, it is time to understand the structure of the data.  Some of the internally generated transactional data may be in the form of relational databases, which are highly structured and relatively easy to work with, but other forms, such as social media data or feeds from supervisory control and data acquisition (SCADA) systems may be time series or only partially structured, and still other forms (such as video feeds) might be completely unstructured.

In order to analyze and correlate the information from these various sources, the data needs to be normalized, in other words, the relationship between the different structured forms of data needs to be mapped (a process called Data Management) and structured information needs to be extracted from the unstructured data.  This is where technologies like Hadoop become very useful, as they allow you to take large amounts of unstructured or semi structured data, and break it into small enough chunks that you can process it to extract the information you want.

Tools like Dell’s Kitenga work with Hadoop to produce a structured data set that can now be analyzed. Products like Toad Intelligence Central can be used to create a data warehouse for highly structured relational data, and products like Boomi Data Management (MDM) can do the same for those external sources that need to be integrated and converted to a standardized format for analysis. Getting to this point may be up to 80 percent of the work in a big data analytics project, and we have not yet done any analytics!  As business leaders, you should remember the old data processing maxim “Garbage In, Garbage Out”, which is completely true here.

Make Predictions Based on Intelligence

At this point, we now have a set of data that we can analyze! We know how to get the feeds we need from the various sources, we can convert the data to a form that we can analyze, and so now how do we do the analytics?  There are two main techniques of analytics, both of which are based on statistical techniques.  Business Intelligence (BI) tools tend to be focused on the first steps of this process, integrating and normalizing the data with a view to presenting it in standardized forms (think reports)  or are more intuitive (think graphics).  I think of BI as a technique for telling you what happened in the past.

Predictive Analytics uses much deeper statistical analysis to give you insights into what might happen in the future.  Tools like Dell’s Statistica are capable of taking large amounts of data, and processing that data using very sophisticated multivariate regression algorithms to help business answer their questions about cause and effect of certain business actions.  Ultimately this is the question businesses need to answer; what set of actions produce the desired outcomes (and conversely, what data can predict adverse outcomes that I need to avoid).

So What is the Role of the Cloud in All of This? 

Well, as we just discussed, the data sources to answer the questions that business are posing are incredibly distributed, with many living in the cloud (like social media) and others only reachable through the cloud (Internet of Things will amplify this trend).  So it makes sense that the cloud should be the starting point for integrating data.  The cloud is also a good place to do some of the data transformation.

While some large companies may have their own Hadoop clusters, most will likely want to take advantage of the convenience and cost of a cloud based environment, especially as the usage may be occasional. The episodic nature of many of these projects makes the cloud the right place to the integration and transformation steps without committing large amounts of capital to the project. Most cloud based services are priced on a usage basis, which has the advantage of minimizing both scarce capital and expense. The third step of this process, the analytics phase can clearly be done in the cloud or on-premises, and our studies suggest that many customers are doing both.  They may set up their infrastructure in a public cloud environment, but as they move into production mode, bring it on-premises to ensure they can manage it more securely.

The ability of cloud environments to scale elastically, which allows a smooth expansion of computing capacity with demand is a real advantage. In the previous mentioned EMA survey, Dell’s Statistica product was the most frequently used cloud analytics tool, in a variety of different cloud environments (Azure, Google and AWS being the most frequently mentioned) but 30 percent in conjunction with an on- premises solution.

The new business applications of the future are likely to be those that bring together large amounts of data from disparate sources like social media and the internet of things, and use predictive analytics to gain insights into future business outcomes.  The cloud will play a key role in the integration, transformation and analysis of this data and Dell is deeply committed to supply its customers and partners with the infrastructure and tools they need.  As business leaders, it is clear that you are moving forward to harness the power of the cloud and use it to bring big data down to earth, and we at Dell are proud to be your partners on that journey.

About the Author: John Swainson