In an IoT solution, there are also several aspects to consider for data storage. The following sections discuss many aspects of this topic.
Storing data on the device
The critical telemetry data is generated on the device, or prior to getting to the device in the case of a gateway. The data may be cached and preprocessed on the device. The reasons for doing this include the desire to optimize the amount of data sent, to minimize “noise” data from analysis, to save on storage costs at the central storage location, minimize transmission time or cost, account for unreliable connectivity, and so on. If data will be stored on the device either temporarily or permanently, there are several local storage considerations, such as those on security, reliability, and capacity. If data is stored on the device, the solution architect needs to consider the implications of losing the data, if the data will expire on the device if it cannot be sent to external storage, and how the system will detect and recover from missing data, should a local outage occur.
Transforming data
Generally the data will go through multiple transformation steps that extend from the generation, sending, storage, and processing of it. As stated in the previous section, there may be data transformation happening on the device itself, such as converting its format, aggregation, and so on. This will rely on local processing capabilities. Other than the local preprocessing, any other transformation would happen at the collection point.
For years, data processing has been thought of in terms on Extract, Transform, and Load (ETL). With the advent of Big Data, much of the discussion has changed to Extract, Load, and then Transform (ELT). The key concept in this transition is that your system is ingesting a huge amount of data, and the transformation process costs significant compute power. Additionally, while this transformation is happening, the data is at risk. If it has not yet been serialized, and the server crashes, then the data is lost. With ELT, the system ingests the data and immediately stores it. This minimizes the exposure of the data during ingestion, and provides new opportunities for data transformation and analysis. First, the data can be transformed asynchronously from ingestion. This helps reduce compute demand. Then the data can be transformed multiple times, for multiple purposes, and this process also supports the idea of collecting all data for extended periods of time. This is often referred to as a “data lake”75, and this strategy suggests keeping “all” data for later analysis. The rationale for this is that machine learning algorithms may find interesting patterns or trends that would not be expected, and that these would warrant studying other seemingly unneeded data.
Location
Most IoT solutions will send data to a public or private cloud. If connected devices are geographically distributed, there may be a case for storing the data across several locations around the globe, in order to store the data closest to where it was generated. There may also be government mandates that require an individual’s data to remain in that person's home country, or the data may only be interesting within the region within which it was collected. However, in a large percentage of projects, the value is in the large body of data, so data must be brought together into a single location for the most insightful analysis. In this case, the considerations will center on the time constraints of the analysis (how often are the algorithms run?), the physical limitations of the data centers, bandwidth, and the cost of moving data.
Longevity, format, and cost
After the data reaches its long-term storage point there are decisions to be made about how to govern that. A data retention policy must be defined. The arguments for long data retention periods are that cloud storage is inexpensive and getter cheaper all the time, and that data scientists want data saved in case a new insight is discovered that warrants looking at data that was previously uninteresting. Even with those benefits, the costs for large volume data storage can add up, and the data could become unmanageable if you do not have a basic plan for how to store, access, and retrieve it. The terms Data Temperature and Hot and Cold storage76 also come up in this context. The concept centers around how frequently accessed the data is, and how quickly the users or systems expect to be able to use the data. Hot data is frequently accessed and users expect good response time. Cold data is data that is less frequently accessed and expected response times can be lower. Classifying data in this manner allows the architects to choose faster and potentially more expensive storage for hot storage and select lower cost options for cold storage.
The format for long-term data storage also needs to be carefully considered. Should it be optimized for Hive queries, or should it be as compact as possible? Or should there be a “fresh” data store with more recent data that is easy to access, process and query, and an archive that is compressed and stored in a way that minimizes cost, but that requires overhead if and when it needs to be accessed. All of these considerations add in to making decisions on how to best store the data.
m.Processing information
After the data is ingested, it must be processed. Processing types range from very simple to long-running and complex. The following sections discuss common IoT data-processing types.
Alarm processing
A common use case is to watch for specific data items on ingestion and then take action based on that data. These could be alarms from devices, or any kind of simple event processing. The characteristic of this type of processing is that there is a specific set of values that are to be monitored on specific attributes of the incoming data that can trigger predetermined responses. While this type of event processing is logically straightforward, the implementation still requires consideration due to the expected high volume of data being ingested, and the likelihood that the events that must be responded to are of relative importance.
In alarm processing, the solution must also account for the potential of alarm floods. If a systemic failure happens, for instance if a home alarm system sends an alarm to the event processing system when the power goes out, there may be a flood of alarms, or if the there is no battery backup, messages may be cached on the device, and then when the power returns, all the devices send their entire set of messages at once. To handle these situations, the devices may be designed to have a random offset for message delays, or the message receiving service can implement a circuit breaker pattern77 to circumvent failure when an abnormal event pattern happens.
Complex-event processing is used to detect conditions or states on data in motion that may not be directly deduced from simple data evaluation. This might include the detection of a certain set of events that arrive in a particular order or frequency, such as an event that is innocuous if it appears once, but that indicates a problem if it occurs a certain number of times in a certain timeframe, or if the same event is transmitted from a set of devices or sensors. Imagine that your car sends telemetry to the manufacturer, and one of the items that it reports is failed starts. By itself, this would mean very little to the manufacturer. However, if the weather got very cold last night, and none of the SuperCar Model 8s in that area started in the morning, that could tell the manufacturer that there is a systemic problem with the car's battery or something related to the starting system.
The industry sees complex event processing as one of the keys to monetizing the vast opportunity of IoT.78 When envisioning the solution, ensure that initial requirements are discussed early in the project. This is an area where businesses will learn and improve over time, but one which should be prototyped early in the process to prove out the concepts, and to begin to develop the right mindset for capitalizing on the opportunities. This is a rich area of development within Microsoft, our competitors, and the open source community. Microsoft has developed StreamInsight,79 which can be deployed in the cloud. A popular open source project is Apache Storm80 for real-time stream processing, and Amazon is offering Kinesis for their cloud solutions, which includes stream processing.
Big Data analysis
One of the main drivers for IoT is the ability to economically collect and store large amounts of data. After the data is collected, it must be processed, aggregated, analyzed to create datasets that can be visualized and used either for business analysis, informing business decisions and strategy, feedback into product engineering to improve products, or provide views of the data that can be shared with partners for monetization or adding value to the business relationship.
The most common approach for this is to use the Map/Reduce81 pattern to batch process collected data. Apache Hadoop is the predominant implementation of that pattern, and Microsoft provides HDInsight, which is a cloud platform service implementation of Hadoop. The approach may be as simple as aggregating and summarizing data for simpler reuse, or it may be complex, multi-step processing that generates insights across the recently collected and historical data. Hadoop includes many tools within its ecosystem that help with searching, querying, and cataloging the data. In solutions today, Hadoop will frequently be used to preprocess data, such that Hadoop jobs will run and create summarized datasets that can be used for querying, reporting, and as input to machine learning activities, or as reference datasets in Complex Event Processing solutions.
Machine learning
Machine learning refers to the concept of studying data and deriving insights from the data. The results will be a model that can be used to predict future outcomes from similar data sets. The first step is to train the model. This is normally an iterative step performed by a data scientist where a training set of data is used to infer a function, or model, from that data. That model will be used to make decisions on incoming data. The model is typically retrained periodically, so that the model can improve over time, learning from additional new data and patterns.
Machine learning falls into two broad categories: supervised learning and unsupervised learning. Supervised learning studies the data looking for a known set of desired outcomes. In other words, in the vehicle scenario, I may want to minimize the number of times that a car needs its oil changed. So I would run studies against the data looking for patterns that give me information about the consequences of delaying oil changes, conditions, and so on. In unsupervised learning, the concept is to naturally find patterns and relationships of any kind in the data. After something interesting is observed, then these data points will be further investigated until they are found to be either useful or not useful.
Common tools for machine learning include MATLAB82, Mahout83 and R84. Microsoft introduced its ML tooling in June 2014, called Azure ML.85 Azure ML is a machine learning service that democratizes the practice of machine learning. It provides a visual experience for constructing data experiments, and easy to use implementations of many commonly used machine learning algorithms, relieving the data scientist of implementing them in a programming language. Azure ML integrates easily with Azure Storage, HDInsight, and Windows Azure SQL Database, and it can expose the models as web services so that they are simple to integrate into the runtime data flow or applications.
Data enhancement
Another core piece of the IoT architecture is data enhancement. The data collected from the devices, the volume of it, and the hidden patterns within it provide tremendous value, but often combining the device data is either critical in order for it to make sense to the business, or there is even more significant value to be gained by adding other data sets to analyze with the device data. Enterprise data may be used for simple things, such as relating device data to customer data. Other areas of opportunity include data markets that publish datasets that are either sold or available for free. Microsoft offers the Azure DataMarket86, which offers datasets from governments, research institutions, historical, environmental, business organizations, and more. One of the most frequent datasets that gets combined with device data is weather. Devices often exist all over the globe in different conditions, so predictive maintenance will frequently factor in weather data, which is normally sourced from weather data providers as opposed to collecting it with the device itself.
Share with your friends: |