What to Do with Incomplete Data? Imputation for Missing Values in Time Series
In the context of the Internet of Things (IoT), sensors continuously provide time series data, but these can have gaps—gaps that threaten the quality and significance of an analysis. This is where data imputation comes into play: A method that estimates missing values and ensures reliability in the insights gained from incomplete data. This article deals with the advantages and capabilities of data imputation, and the circumstances under which each method is applicable to different types of missing data.
What is Imputation? A Definition
Imputation is a statistical procedure that is utilized for the purpose of supplementing missing values in a meaningful manner. The range of possible data imputation methods includes simple statistical algorithms such as the mean, median, or mode, as well as more sophisticated techniques such as multiple imputation by chained equations (MICE) or expectation maximization (EM) imputation. At the other end of the spectrum are complex neural networks that are individually trained for datasets and used to impute missing values.
The imputation of time series data presents unique challenges. As there is always a sequence or timestamp, it is crucial to preserve the temporal structure. Estimating missing values at any point in time without considering the sequence or temporal context would distort the analysis. Consequently, specialized imputation methods are necessary to account for temporal dependencies and achieve a realistic completion of the data.
What are the underlying causes of missing data?
The absence of data from IoT sensors can be the result of a number of factors, including connection loss, network delay, or low batteries. While a failure due to an empty battery may be frustrating, it is often explainable. In such cases, the data can be easily added or omitted from an analysis. However, the underlying causes are frequently more complex and sporadic. For instance, incorrectly calibrated sensors with inaccurate measuring ranges may only manifest as gaps in data when viewed in a wider context. In such cases, a thorough diagnosis is necessary to understand the true nature of the gaps and their underlying causes.
The following three types of missing data are distinguished as basic categories:
Missing Completely at random (MCAR)
In the simplest case, the values are missing according to a purely random pattern. This is illustrated on the left in the example. It can be observed that there is no connection between the missing values and the color. Furthermore, there is no correlation between the occurrence of missing values and other features. The values are missing completely at random.
Missing at Random (MAR)
In the event that values are absent according to the MAR pattern, this indicates that the absence is contingent upon the observation of other features. As illustrated in the example, this pattern can be observed in the middle, where values are consistently absent in feature 1 if feature 2 has a red value.
Missing Not at Random (MNAR)
Missing values according to the MNAR scheme occur according to unobserved patterns. This implies that the pattern depends on the missing data itself, or unobserved features. In the illustration, this is demonstrated on the right. Here, values are invariably absent if they are red. This signifies that the observed data set is lacking red values, despite the fact that there are red values in the complete data set.
Particularly for this category of missing values, it is inadvisable to disregard missing values and delete them row by row. Instead, it is recommended to examine the context of the incomplete data.
What are the capabilities of imputation?
A comprehensive diagnosis represents the initial step in the process of filling in missing data in a meaningful manner. Prior to the employment of an imputation method, it is essential to gain an understanding of the underlying causes and patterns associated with the occurrence of data gaps following specific events. A systematic investigation into the nature of these data gaps enables the selection of an appropriate imputation approach and the identification of potential weaknesses in the system.
The application of good imputation practices can enhance the quality of data, but there are inherent limitations to this approach. If the imputation methodology is not well-suited to the task, it can introduce bias into the data. Models trained on such “augmented” data may be susceptible to bias, leading to distorted results. Therefore, it is essential to select the imputation method carefully and to assess its alignment with the underlying data pattern.
Conclusion
Missing data imputation is a process that extends beyond just filling gaps. It is a crucial method for enhancing the quality of data. By employing targeted diagnostics and selecting appropriate techniques, it is possible to make IoT data more reliable and meaningful. This illustrates that imputation is not only a tool for data preparation but also the foundation for reliable findings and improved decision-making in data-driven processes.