Chapter 4 Missing values

Let us analyze the missing patterns in our data frame

According to the output of plot_missing function, we can confirm that there are a total of 29 missing patterns in our dataset. But if we look at plots in detail, the most common pattern is the one that has no missing values, followed by the pattern with missing values only in the ‘grade’ column. These two cases account for almost all of the most frequent missing patterns. We can also infer from the upper side plot that ‘grade’ has about 50% of its missing values and all other columns have relatively much fewer proportions of missing values.

Since there are so many rows with missing values in ‘grade’, dropping these rows would not make sense. Instead, we will try to impute the ‘grade’ column values by finding exciting patterns in the dataset in further exploration.

Let us now try to see how many rows have a given number of missing values.

We can see that most of the rows in our dataset either have no missing values or have exactly one missing value. The reason for this is evident from the missing-patterns plot we showed above; since the column ‘grade’ has so many missing values. This will be fixed when we impute the values for ‘grade’. We will consider dropping rows with more than five missing values; this is reasonable as these rows are relatively less in quantity.

To better understand the pattern of missing values, we have to check for any placeholders for missing values. A nice strategy for this is to look for outliers.

Let us try to see when most of the inspections were conducted.

Here, we can see that the inspection date with year = 1900 is an outlier and possibly a placeholder for missing values. We confirm below by analyzing the missing patterns of the subset of our dataset with the inspection year 1900.

We can see that most of the data in rows with the inspection year equal to 1900 is missing. Therefore, we will consider dropping these rows.

Let us try to understand if there’s any relation between missing-patterns and borough.

## [1] Queens        Brooklyn      Manhattan     Bronx         Staten Island
## Levels: Bronx Brooklyn Manhattan Queens Staten Island

We can see that none of these seem to display any relation between missing-patterns and boroughs, which is worth investigating further.

4.1 Transformation

We drop the rows that have 1900 as their inspection year

##      camis                               dba      boro           street   zip   cuisine
## 1 41170643         LITTLE MOROCCO RESTAURANT    Queens  STEINWAY STREET 11103  Moroccan
## 2 41610426              GLOW THAI RESTAURANT  Brooklyn         3 AVENUE 11209      Thai
## 3 41711813                     BEDFORD FALLS Manhattan EAST   67 STREET 10065  American
## 4 41429907                      MARIO'S DELI  Brooklyn    NASSAU AVENUE 11222  American
## 5 41624260                 ZENG'S RESTAURANT  Brooklyn    FOURTH AVENUE 11232   Chinese
## 6 41530091 JOY & SNOOK RESTAURANT AND BAKERY  Brooklyn  NOSTRAND AVENUE 11216 Caribbean
##    insp_date     action vio_code
## 1 2019-11-04 violations      05H
## 2 2019-08-29 violations      08A
## 3 2016-07-22 violations      10I
## 4 2019-08-30 violations      04F
## 5 2018-12-10 violations      06D
## 6 2017-10-03 violations      22F
##                                                                                                                                         vio_desc
## 1                                                                 No facilities available to wash, rinse and sanitize utensils and/or equipment.
## 2             Facility not vermin proof. Harborage or conditions conducive to attracting vermin to the premises and/or allowing vermin to exist.
## 3                                                              Single service item reused, improperly stored, dispensed; not used when required.
## 4                     Food, food preparation area, food storage area, area used by employees or patrons, contaminated by sewage or liquid waste.
## 5 Food contact surface not properly washed, rinsed and sanitized after each use and following any activity when contamination may have occurred.
## 6                                                                                                                        MISBRANDED AND LABELING
##           flag score grade        insp_prog          insp_type latitude longitude
## 1     Critical    37  <NA> Cycle Inspection Initial Inspection 40.76848 -73.91123
## 2 Not Critical     9     A Cycle Inspection      Re-inspection 40.63386 -74.02680
## 3 Not Critical     9     A Cycle Inspection Initial Inspection 40.76636 -73.96250
## 4     Critical    13     A Cycle Inspection      Re-inspection 40.72597 -73.94060
## 5     Critical    28     C Cycle Inspection      Re-inspection 40.65439 -74.00434
## 6 Not Critical    29  <NA> Cycle Inspection Initial Inspection 40.67266 -73.95025

Further, consider

##    insp_year      n
## 1       2009      1
## 2       2013      8
## 3       2014     15
## 4       2015    116
## 5       2016  10400
## 6       2017  57401
## 7       2018  98407
## 8       2019 136652
## 9       2020  29512
## 10      2021  24635

We can see that there are very few observations from the years before 2016. In order to draw meaningful insights, we shall drop all rows with inspection year less than 2016 as we have very less data from this time period. Hence, we’ll analyse only the subset with inspection year from 2016 to 2021.

Let us now save this dataset for further analysis

##      camis                               dba      boro           street   zip   cuisine
## 1 41170643         LITTLE MOROCCO RESTAURANT    Queens  STEINWAY STREET 11103  Moroccan
## 2 41610426              GLOW THAI RESTAURANT  Brooklyn         3 AVENUE 11209      Thai
## 3 41711813                     BEDFORD FALLS Manhattan EAST   67 STREET 10065  American
## 4 41429907                      MARIO'S DELI  Brooklyn    NASSAU AVENUE 11222  American
## 5 41624260                 ZENG'S RESTAURANT  Brooklyn    FOURTH AVENUE 11232   Chinese
## 6 41530091 JOY & SNOOK RESTAURANT AND BAKERY  Brooklyn  NOSTRAND AVENUE 11216 Caribbean
##    insp_date     action vio_code
## 1 2019-11-04 violations      05H
## 2 2019-08-29 violations      08A
## 3 2016-07-22 violations      10I
## 4 2019-08-30 violations      04F
## 5 2018-12-10 violations      06D
## 6 2017-10-03 violations      22F
##                                                                                                                                         vio_desc
## 1                                                                 No facilities available to wash, rinse and sanitize utensils and/or equipment.
## 2             Facility not vermin proof. Harborage or conditions conducive to attracting vermin to the premises and/or allowing vermin to exist.
## 3                                                              Single service item reused, improperly stored, dispensed; not used when required.
## 4                     Food, food preparation area, food storage area, area used by employees or patrons, contaminated by sewage or liquid waste.
## 5 Food contact surface not properly washed, rinsed and sanitized after each use and following any activity when contamination may have occurred.
## 6                                                                                                                        MISBRANDED AND LABELING
##           flag score grade        insp_prog          insp_type latitude longitude
## 1     Critical    37  <NA> Cycle Inspection Initial Inspection 40.76848 -73.91123
## 2 Not Critical     9     A Cycle Inspection      Re-inspection 40.63386 -74.02680
## 3 Not Critical     9     A Cycle Inspection Initial Inspection 40.76636 -73.96250
## 4     Critical    13     A Cycle Inspection      Re-inspection 40.72597 -73.94060
## 5     Critical    28     C Cycle Inspection      Re-inspection 40.65439 -74.00434
## 6 Not Critical    29  <NA> Cycle Inspection Initial Inspection 40.67266 -73.95025