NA Checking

Philip Waggoner

August 07, 2023

Introductory Remarks

A common part of any missing data / imputation project is checking for missingness. This is often done prior to modeling, during fitting in some cases, and after imputation to ensure missingness is sufficiently addressed. In many cases, missingness can look like it’s dealt with form one view, but still be present from another. Regardless of why, there exists the need for constantly checking for missingness, and where it exists in the data throughout the process.

The latest release of hdImpute includes two helper functions to check for missingness (“NA”) at the column and row levels. The functions are:

Overview of the Functions

These helpers aren’t complex in architecture. But rather, they are intended to provide the practical information needed to assess missingness. That is, rather than add up all missing cases, or find the mean missingness in a column (both of which are still helpful), these helpers are intended to help the user find and assess by returning the amount of missingness and also where it exists in the data.

The first column-wise function, check_feature_na(), takes two inputs: data and threshold. The data is simply the dataframe under consideration (the “original” data). The threshold is the level of missingness in a given column/feature as a proportion bounded between 0 and 1. Default set to sensitive level at 1e-04. That is, how much missingness do you want to locate in a given column? Return the column name and the rate of missingness for the column, for all columns that are at the supplied threshold or greater.

The second row-wise function, check_row_na(), is similar in scope but focuses at the case / observation level. This function also takes two inputs: data and which. As before, data is the raw, original data with missingness. which is optional, and logical. Its default is set to FALSE. But if TRUE, check_row_na() then returns a list with the row numbers (indices) corresponding to each row with missingness. So, if which = FALSE (the default), then the return is simply the number of rows in data with any missingness. But if which = TRUE, then the return is a tibble containing the number of rows in data with any missingness, and a list of which rows/row numbers contain missingness.

A Few Examples

First, load the library along with the tidyverse library for some additional helpers in setting up the sample data space.

library(hdImpute)
library(tidyverse)

Next, set up the data and introduce missingness completely at random (MCAR) via the prodNA() function from the missForest package. Take a look at the synthetic data with missingness introduced.

d <- data.frame(X1 = c(1:6), 
                X2 = c(rep("A", 3), 
                       rep("B", 3)), 
                X3 = c(3:8),
                X4 = c(5:10),
                X5 = c(rep("A", 3), 
                       rep("B", 3)), 
                X6 = c(6,3,9,4,4,6))

set.seed(1234)

data <- missForest::prodNA(d, noNA = 0.30) %>% 
  as_tibble()

data
#> # A tibble: 6 × 6
#>      X1 X2       X3    X4 X5       X6
#>   <int> <chr> <int> <int> <chr> <dbl>
#> 1     1 <NA>      3     5 A         6
#> 2    NA A         4     6 A         3
#> 3     3 <NA>      5     7 A         9
#> 4    NA B        NA    NA <NA>      4
#> 5    NA B         7     9 B        NA
#> 6    NA B         8    10 B         6

Note: This is a tiny sample set, but hopefully the usage is clear enough.

check_feature_na()

First, let’s take a look at a few variations of the column-wise checks. The default behavior is with threshold = 1e-04, functionally asking to return a column with even a tiny amount of missingness (0.0001):

check_feature_na(data = data)
#> [1] "X1" "X2" "X3" "X4" "X5" "X6"

Now, consider an adjustment to the missingness threshold, setting threshold = 0.5 (or 50%), which is a much less sensitive check:

check_feature_na(data = data,
         threshold = 0.5)
#> [1] "X1"

And finally, let’s verify our check is working properly by looking at the original data with no missingness. This should return 0 columns, as we set up the original data (d) to be complete:

check_feature_na(d)
#> character(0)

All looks as it should.

check_row_na()

Now, let’s take a look at row-wise missingness checks.

First, how many rows have missingness of any amount? (the default behavior)

check_row_na(data)
#> [1] 6

Next, how many rows have missingness of any amount and which rows are they?

check_row_na(data, which = TRUE)
#> # A tibble: 1 × 2
#>   n_rows_missing which_rows
#>            <int> <list>    
#> 1              6 <int [6]>

There’s the list, but what are the indices? We need to unnest() the list to see:

check_row_na(data, which = TRUE)[2] %>%
  unnest(cols = c(which_rows))
#> # A tibble: 6 × 1
#>   which_rows
#>        <int>
#> 1          1
#> 2          2
#> 3          3
#> 4          4
#> 5          5
#> 6          6

And finally, as before, what about the original data with no missingness? (should be 0)

check_row_na(d)
#> [1] 0

Hopefully the value of these simple helpers is clear, allowing for iterative NA checking throughout the imputation process.

Concluding Remarks

This software is being actively developed, with many more features to come. Wide engagement with it and collaboration is welcomed! Here’s a sampling of how to contribute:

Thanks for using the tool. I hope its useful.