Chapter 3 Data transformation

Let us begin by loading the dataset

##      CAMIS                      DBA      BORO BUILDING             STREET ZIPCODE      PHONE
## 1 41393637       DELICIAS MEXICANAS Manhattan     2109           3 AVENUE   10029 2128283659
## 2 50013433                BALVANERA Manhattan      152     STANTON STREET   10002 2125333348
## 3 50008129   COCO FRESH TEA & JUICE    Queens    86-55           BROADWAY   11373 3477306730
## 4 50100295      EL BURRITO NACHO II  Brooklyn      747    FLUSHING AVENUE   11206 9292502087
## 5 41696471 MAISON HARLEM RESTAURANT Manhattan      341 ST NICHOLAS AVENUE   10027 2122229224
## 6 50098871            NORIBAR SUSHI  Brooklyn      326    KINGSTON AVENUE   11213 3479138888
##   CUISINE.DESCRIPTION INSPECTION.DATE                                          ACTION
## 1             Mexican      03/17/2017 Violations were cited in the following area(s).
## 2             Spanish      05/21/2019 Violations were cited in the following area(s).
## 3          Coffee/Tea      10/15/2018 Violations were cited in the following area(s).
## 4             Mexican      02/21/2020 Violations were cited in the following area(s).
## 5              French      01/28/2020 Violations were cited in the following area(s).
## 6       Jewish/Kosher      08/30/2021 Violations were cited in the following area(s).
##   VIOLATION.CODE
## 1            10D
## 2            10D
## 3            10H
## 4            02G
## 5            06D
## 6            10F
##                                                                                                                                                                                                                                                             VIOLATION.DESCRIPTION
## 1                                                                      Mechanical or natural ventilation system not provided,  improperly installed, in disrepair and/or fails to prevent excessive build-up of grease, heat, steam condensation vapors, odors, smoke, and fumes.
## 2                                                                      Mechanical or natural ventilation system not provided,  improperly installed, in disrepair and/or fails to prevent excessive build-up of grease, heat, steam condensation vapors, odors, smoke, and fumes.
## 3                                                                                                                                                                                                            Proper sanitization not provided for utensil ware washing operation.
## 4                                                                                                                                              Cold food item held above 41쨘 F (smoked fish and reduced oxygen packaged foods above 38 쨘F) except during necessary preparation.
## 5                                                                                                                                  Food contact surface not properly washed, rinsed and sanitized after each use and following any activity when contamination may have occurred.
## 6 Non-food contact surface improperly constructed. Unacceptable material used. Non-food contact surface or equipment improperly maintained and/or not properly sealed, raised, spaced or movable to allow accessibility for cleaning on all sides, above and underneath the unit.
##   CRITICAL.FLAG SCORE GRADE GRADE.DATE RECORD.DATE                          INSPECTION.TYPE
## 1  Not Critical    11     A 03/17/2017  11/18/2021    Cycle Inspection / Initial Inspection
## 2  Not Critical    29  <NA>       <NA>  11/18/2021    Cycle Inspection / Initial Inspection
## 3  Not Critical    22  <NA>       <NA>  11/18/2021    Cycle Inspection / Initial Inspection
## 4      Critical    11     A 02/21/2020  11/18/2021 Pre-permit (Operational) / Re-inspection
## 5      Critical    32  <NA>       <NA>  11/18/2021    Cycle Inspection / Initial Inspection
## 6  Not Critical    18  <NA>       <NA>  11/18/2021    Cycle Inspection / Initial Inspection
##   Latitude Longitude Community.Board Council.District Census.Tract     BIN        BBL  NTA
## 1 40.79753 -73.94031             111                8        18800 1052516 1016650002 MN34
## 2 40.72053 -73.98544             103                2         3001 1077440 1003550070 MN27
## 3 40.73740 -73.87729             404               25        47300 4045397 4018427502 QN29
## 4 40.70062 -73.94281             301               33        50700 3061516 3022760035 BK75
## 5 40.81208 -73.95144             109                9        21303 1059391 1019540050 MN06
## 6 40.66778 -73.94236             309               35        33500 3033372 3012780141 BK63

We will first drop the columns for which the creator has not provided any information as well as those columns which we believe are not useful for drawing insights.

##      CAMIS                      DBA      BORO             STREET ZIPCODE CUISINE.DESCRIPTION
## 1 41393637       DELICIAS MEXICANAS Manhattan           3 AVENUE   10029             Mexican
## 2 50013433                BALVANERA Manhattan     STANTON STREET   10002             Spanish
## 3 50008129   COCO FRESH TEA & JUICE    Queens           BROADWAY   11373          Coffee/Tea
## 4 50100295      EL BURRITO NACHO II  Brooklyn    FLUSHING AVENUE   11206             Mexican
## 5 41696471 MAISON HARLEM RESTAURANT Manhattan ST NICHOLAS AVENUE   10027              French
## 6 50098871            NORIBAR SUSHI  Brooklyn    KINGSTON AVENUE   11213       Jewish/Kosher
##   INSPECTION.DATE                                          ACTION VIOLATION.CODE
## 1      03/17/2017 Violations were cited in the following area(s).            10D
## 2      05/21/2019 Violations were cited in the following area(s).            10D
## 3      10/15/2018 Violations were cited in the following area(s).            10H
## 4      02/21/2020 Violations were cited in the following area(s).            02G
## 5      01/28/2020 Violations were cited in the following area(s).            06D
## 6      08/30/2021 Violations were cited in the following area(s).            10F
##                                                                                                                                                                                                                                                             VIOLATION.DESCRIPTION
## 1                                                                      Mechanical or natural ventilation system not provided,  improperly installed, in disrepair and/or fails to prevent excessive build-up of grease, heat, steam condensation vapors, odors, smoke, and fumes.
## 2                                                                      Mechanical or natural ventilation system not provided,  improperly installed, in disrepair and/or fails to prevent excessive build-up of grease, heat, steam condensation vapors, odors, smoke, and fumes.
## 3                                                                                                                                                                                                            Proper sanitization not provided for utensil ware washing operation.
## 4                                                                                                                                              Cold food item held above 41쨘 F (smoked fish and reduced oxygen packaged foods above 38 쨘F) except during necessary preparation.
## 5                                                                                                                                  Food contact surface not properly washed, rinsed and sanitized after each use and following any activity when contamination may have occurred.
## 6 Non-food contact surface improperly constructed. Unacceptable material used. Non-food contact surface or equipment improperly maintained and/or not properly sealed, raised, spaced or movable to allow accessibility for cleaning on all sides, above and underneath the unit.
##   CRITICAL.FLAG SCORE GRADE                          INSPECTION.TYPE Latitude Longitude
## 1  Not Critical    11     A    Cycle Inspection / Initial Inspection 40.79753 -73.94031
## 2  Not Critical    29  <NA>    Cycle Inspection / Initial Inspection 40.72053 -73.98544
## 3  Not Critical    22  <NA>    Cycle Inspection / Initial Inspection 40.73740 -73.87729
## 4      Critical    11     A Pre-permit (Operational) / Re-inspection 40.70062 -73.94281
## 5      Critical    32  <NA>    Cycle Inspection / Initial Inspection 40.81208 -73.95144
## 6  Not Critical    18  <NA>    Cycle Inspection / Initial Inspection 40.66778 -73.94236

Let us rename each column to its shorter version for easier access

We will separate out inspection type into two columns denoting inspection program and inspection type

##      camis                      dba      boro             street   zip       cuisine  insp.date
## 1 41393637       DELICIAS MEXICANAS Manhattan           3 AVENUE 10029       Mexican 03/17/2017
## 2 50013433                BALVANERA Manhattan     STANTON STREET 10002       Spanish 05/21/2019
## 3 50008129   COCO FRESH TEA & JUICE    Queens           BROADWAY 11373    Coffee/Tea 10/15/2018
## 4 50100295      EL BURRITO NACHO II  Brooklyn    FLUSHING AVENUE 11206       Mexican 02/21/2020
## 5 41696471 MAISON HARLEM RESTAURANT Manhattan ST NICHOLAS AVENUE 10027        French 01/28/2020
## 6 50098871            NORIBAR SUSHI  Brooklyn    KINGSTON AVENUE 11213 Jewish/Kosher 08/30/2021
##                                            action vio.code
## 1 Violations were cited in the following area(s).      10D
## 2 Violations were cited in the following area(s).      10D
## 3 Violations were cited in the following area(s).      10H
## 4 Violations were cited in the following area(s).      02G
## 5 Violations were cited in the following area(s).      06D
## 6 Violations were cited in the following area(s).      10F
##                                                                                                                                                                                                                                                                          vio.desc
## 1                                                                      Mechanical or natural ventilation system not provided,  improperly installed, in disrepair and/or fails to prevent excessive build-up of grease, heat, steam condensation vapors, odors, smoke, and fumes.
## 2                                                                      Mechanical or natural ventilation system not provided,  improperly installed, in disrepair and/or fails to prevent excessive build-up of grease, heat, steam condensation vapors, odors, smoke, and fumes.
## 3                                                                                                                                                                                                            Proper sanitization not provided for utensil ware washing operation.
## 4                                                                                                                                              Cold food item held above 41쨘 F (smoked fish and reduced oxygen packaged foods above 38 쨘F) except during necessary preparation.
## 5                                                                                                                                  Food contact surface not properly washed, rinsed and sanitized after each use and following any activity when contamination may have occurred.
## 6 Non-food contact surface improperly constructed. Unacceptable material used. Non-food contact surface or equipment improperly maintained and/or not properly sealed, raised, spaced or movable to allow accessibility for cleaning on all sides, above and underneath the unit.
##           flag score grade                insp.prog          insp.type latitude longitude
## 1 Not Critical    11     A         Cycle Inspection Initial Inspection 40.79753 -73.94031
## 2 Not Critical    29  <NA>         Cycle Inspection Initial Inspection 40.72053 -73.98544
## 3 Not Critical    22  <NA>         Cycle Inspection Initial Inspection 40.73740 -73.87729
## 4     Critical    11     A Pre-permit (Operational)      Re-inspection 40.70062 -73.94281
## 5     Critical    32  <NA>         Cycle Inspection Initial Inspection 40.81208 -73.95144
## 6 Not Critical    18  <NA>         Cycle Inspection Initial Inspection 40.66778 -73.94236

Consider the unique values in borough

## 
##             0           210         Bronx      Brooklyn     Manhattan        Queens Staten Island 
##           118             2         35564         97059        149598         88597         12261

‘0’ borough is vague, we’ll drop these rows:

## [1] "There are 118 rows with 0 as the borough"

Consider the unique values in grade column:

## 
## 01/03/2019 01/03/2020          A          B          C          G          N          P 
##          1          1     152011      23514       9222          1       4889       2487 
##          Z 
##       1324

As per the data dictionary, grade value ‘G’ is not defined. Since only one row contains this, we can get rid of this row.

##       camis         dba        boro      street         zip     cuisine   insp.date      action 
##   "integer" "character" "character" "character" "character" "character" "character" "character" 
##    vio.code    vio.desc        flag       score       grade   insp.prog   insp.type    latitude 
## "character" "character" "character" "character" "character" "character" "character"   "numeric" 
##   longitude 
##   "numeric"
##       camis         dba        boro      street         zip     cuisine   insp.date      action 
##    "factor"    "factor"    "factor" "character"    "factor"    "factor"      "Date"    "factor" 
##    vio.code    vio.desc        flag       score       grade   insp.prog   insp.type    latitude 
##    "factor"    "factor"    "factor" "character"    "factor"    "factor"    "factor"   "numeric" 
##   longitude 
##   "numeric"

We will now drop the duplicate rows

## [1] "Number of rows originally: 383080"
## [1] "Number of rows after removing duplicates: 360458"

We will replace lengthy categories in ‘action’ by keywords

## 
## Establishment Closed by DOHMH. Violations were cited in the following area(s) and those requiring immediate action were addressed. 
##                                                                                                                              14427 
##                                                                                                  Establishment re-closed by DOHMH. 
##                                                                                                                                 65 
##                                                                                                  Establishment re-opened by DOHMH. 
##                                                                                                                               3867 
##                                                                        No violations were recorded at the time of this inspection. 
##                                                                                                                               4463 
##                                                                                    Violations were cited in the following area(s). 
##                                                                                                                             333309
## 
##        closed      reclosed      reopened no violations    violations 
##         14427            65          3867          4463        333309

Let us now save this cleaned dataset