Log in
  

DA & Control

Knowledge Discovery from Sensor Data

March 1, 2006 By: Pang-Ninh Tan, Michigan State University Sensors


Anomaly detection, also known as outlier or deviation detection, seeks to identify instances of unusual activities present in the data. A common way to do this is to construct a profile of the normal behavior of the data (Figure 5) and use it to compute the anomaly scores of other observations. Among the widely used anomaly detection techniques are statistical-based approaches such as Grubbs test and boxplots, and distance-based, density-based, and cluster-based techniques. In the distance-based approach, the normal profile corresponds to the average distance between every observation to its corresponding kth closest neighbor. If the distance from a given observation to its kth closest neighbor is significantly higher than the overall average, then the observation may be regarded as an anomaly.

Figure 5. Detection of anomalies from time series segments; time series on upper left corner diagram is normal, while the rest contain anomalies of some type
Figure 5. Detection of anomalies from time series segments; time series on upper left corner diagram is normal, while the rest contain anomalies of some type

The key challenge of an anomaly detection algorithm is to maintain a high detection rate while keeping the false alarm rate low. This requires the construction of an accurate and representative normal profile, a task that can be very difficult for large-scale sensor network applications.

Consider This

There are several issues we must consider when applying data mining techniques to sensor data. First, we need to determine the appropriate computational model. There are two general models of computation: centralized and distributed (peer-to-peer) [8]. In the centralized model, each sensor transmits the data it has collected to a central server, which fuses the sensor readings and performs extensive analysis on the aggregated data. An obvious drawback of this approach is its high consumption of energy and bandwidth. Furthermore, it is not scalable to very large numbers of sensors. The distributed model, on the other hand, requires each sensor to perform some local computations before communicating their partial results to other nodes in order to obtain a global solution. This approach is more promising, but it requires every sensor to have an onboard processor with a reasonable amount of memory storage and computing power.

Sensor data characteristics present additional challenges to the data mining algorithm. These data tend to be noisy or to contain measurements with large degrees of uncertainty. Probability-based algorithms offer a more robust approach for handling such problems. Another issue to consider is missing values due to malfunctioning sensors. This problem can be addressed in many ways, either during preprocessing or during the mining step itself, e.g., by discarding the observations or estimating their true values based on the distribution of the remaining data. The massive streams of sensor data generated in some applications make it impossible to use algorithms that must store the entire data into main memory. Online algorithms provide an attractive alternative to conventional batch algorithms for handling such large data sets. Finally, the data mining algorithm must consider the effect of concept drift, where characteristics of the monitored process may change over time and render the old models outdated. This problem can be addressed using a mechanism that helps the model to "forget" its previous information.

In short, data mining provides a suite of automated tools to help scientists and engineers uncover useful information hidden in large quantities of sensor data. It also provides an opportunity for data mining researchers to develop more advanced methods for handling some of the issues specific to sensor data. For a list of the many commercial and publicly available data mining software packages available, see www.kdnuggets.com/software/index.html.

References

1. Proc 1st Intl Workshop on Data Mining in Sensor Networks, www.siam.org/meetings/sdm05/sdm-sensor-networks.zip, 2005.

2. P-N Tan, M. Steinback, and V. Kumar, Introduction to Data Mining, Addison-Wesley, 2005.

1 2 3 4 5