Exploratory Analysis of Market Health Violations

by: Eddie Liu, Bryan Kim, Kei Hagiwara, Eddy Lara

1) Background Information

Introduction

"This dataset contains Environmental Health Inspection Results for Restaurants and Markets in the City of Los Angeles. The Los Angeles County Environmental Health is responsible for inspections and enforcement activities for all unincorporated areas and 85 of the 88 cities in the County. This dataset combines some of the fields from the County's inspection and violation data, and is filtered to include only facilities in the City of Los Angeles." https://data.lacounty.gov/Health/LOS-ANGELES-COUNTY-RESTAURANT-AND-MARKET-INSPECTIO/6ni6-h5kp

Guiding Thoughts

Our analysis will focus mainly on questions regarding health scores of establishments. Some guiding questions that inspired us to choose this dataset include: Is there a geographic pattern to restaurants receiving certain scores? What are the most common health scores? Is there a relationship between the Yelp rating of restaurants and the health scores? Another primary incentive that propels us to research and analyze this dataset is the notion of associating health scores with the probability of contracting a food-borne illness at restaurants. It is without much doubt that most of the general population would prefer not to contract such illnesses when going out to eat, which is why analyzing trends in health scores is important. A research article called the “Impact of Restaurant Hygiene Grade Cards on Foodborne-Disease Hospitalizations in Los Angeles County” found that there was a “significant decrease in foodborne-disease hospitalizations in Los Angeles County following the introduction of grade cards.” The authors found that the restaurant health cards were associated with a 13.1% decrease in the amount of people hospitalized with food-borne illnesses. From this, we see the importance in the role health cards have in association with the decreasing amount of food-borne illnesses. It is of much interest to us to analyze the trends in these scores in order to promote the wellbeing of the general population that frequents these establishments.

Overview of Health Inspections

The Los Angeles County Environmental Health acts as an enforcement agency for the Department of Health in the County of Los Angeles. Its enforcers are commonly known as “Health Inspectors” (though their official title is “Environment Health Specialist”) and these are the individuals that carry out the inspections of each establishment and assign them a letter grade. Each food inspection begins with 100 points. As a health inspector conducts an inspection, the appropriate sub-categories in the Critical Risk Factors and the Good Retail Practices Section on the Retail Food Official Inspection Report (FOIR) will be marked if appropriate. The card issued by the health inspector will be based upon the score received on the Retail Food Official Inspection Report. 100 - 90 points is an A, 89 - 80 points is a B, 79 - 70 points is a C, and 69 - 0 receive no score. A full description of the report can be found here http://publichealth.lacounty.gov/eh/docs/WhatsNew/FOIR-RefGuide.pdf. Our report will focus on those restaurants in 2018, with special emphasis on the establishments that received a B or C score as these locations may be more prone to carrying illnesses. Restaurants with a B score rating have “Generally good food handling practices and overall food facility maintenance” whereas restaurants with a C score rating have “Generally acceptable food handling practices and overall general food facility maintenance.”

According to LA Public Health: “A facility can be inspected from 1 to 3 times a year. The frequency is based on the public health risk associated with the food products served, the methods of food preparation, and the operational history of the food facility.” Thus we assume that in our dataset that the maximum number of times a restaurant is graded is 3 times per year. Each row in the dataset represents a unique instance of a violation. For example: a restaurant that has received 7 violations in an inspection will have 7 rows in the dataset, each with a unique identifier (record id).

Variable Discussion

1) "activity_date”. This column of our dataset represents the time and date that the restaurant was inspected. The date range for this data is from 2015 to 2018. For the purposes of our analysis, we will only look at restaurants with an activity date in 2018, and we will disregard the time as most entries have the standardized "12:00:00 AM" timestamp.

2) “facility_name”. This represents the name of the establishment that was inspected. We will be working with around 2,000 unique establishments in this dataset.

3) "violation_code". These violation codes are listed from the Los Angeles Department of Public Health website, with each code corresponding to a unique violation description.

4) "violation_description". These are the standards that all restaurants and markets are inspected to in Los Angeles. If a restaurant receives the violation as described by the violation description, it means the restaurant did not meet that standard. Ex: "# 14. Food contact surfaces: clean and sanitized" means a restaurant's food contact surfaces were not cleaned and sanitized to a satisfactory degree.

5) "points". This column represents the severity of the violation of the business. Each row represents a separate violation and those points correspond to a specific violation, regardless of whether or not the restaurant had multiple violations. The greater the point value, the more severe the violation.

6) "grade". This is the overall health grade that the restaurant ended up with at the time of the violation. Referring to the explanation above, a ranking of A represents a score of 100-90, a rating of B represents a score of 89-80, and a ranking of C represents a score of 79-70. Anything below 70 would not receive a score.

7) "facility_address". This is the address of the establishment registered with the City of Los Angeles that indicates the location of the restaurant.

8) "facility_zip". This is the zip code of the physical location of the establishment. Originally, the data included the city and state but every single entry in this particular dataset is Los Angeles, CA so these two columns were dropped from our dataset.

9) "pe_description". This is the type of establishment categorized by risk factor and size. For example, this includes whether the establishment is a restaurant, market, caterer etc. The size and risk factor are determined by the Los Angeles Department of Public Health.

10) "record_id". This represents each row in the dataset.

11) "score". This is the overall health score that the location earned at the time of the violation. The scores are calculated as follows: 100 - the sum of violation points found during the inspection. Major violations are 4 points and minor violations are 2 point deductions. A 0 point deduction doesn't constitute a deduction in the negative sense; it represents a field which the inspector marks down as a note. Major violations constitute an immediate health hazard that requires immediate correction and possible closure of the restaurant. Minor violations don’t constitute an immediate health hazard, but they do warrant necessary corrections in the future. If the inspector deems that the restaurant follows good retail practices despite the violations, the minor violation will only be a 1 point deduction. These are low risk violations that typically represent corrections to good health practices.

12) "rating". This represents the Yelp rating of the establishment. This is on a scale from 1.0 - 5.0 with 1.0 representing the worst end of the scoring spectrum and 5.0 representing the best end. This column was pulled from Yelp’s website.

13) "price_category". This is the price category of the establishment which was also pulled from Yelp’s website. A single dollar sign represents a generally cheaper location and a quadruple dollar sign represents a generally more expensive location, with a double and triple dollar sign representing the middle of the spectrum.

14) "type". This is the type of the establishment. This includes the ethnicities (ex: Chinese, Mexican, Hawaiian, etc.) or other descriptions (ex: Diner, Fast Food, Coffee, etc.), depending on the first Yelp description pulled.

15) "year". This is the year which the violation took place. We will be focusing solely on 2018 entries.

Limitations

As with any analysis, there are certain limitations that may inhibit inspection of certain areas. For our project, we are limited by how much the violation columns tell us about the specific violation each establishment received. We are only given the general category of the violation and not the specific violation (which can be found on http://publichealth.lacounty.gov/eh/docs/WhatsNew/FOIR-RefGuide.pdf). Another limitation is how the Yelp API does not allow us to scrape all of the establishments (we were only able to get about 2203 out of around 9000 establishments in Los Angeles). In addition to this, the "type" column extracted from Yelp only extracts the first description from a Yelp entry. Ex: El Carmen is labeled as 'Mexican' and 'Cocktail Bars' on Yelp but would only be classified as 'Mexican' under "type". Toca Madera is labeled as 'Cocktail Bars', 'Mexican', and 'Breakfast & Brunch' on Yelp but would only be classified as 'Cocktail Bars' under "type". This is due to how the establishment owner wishes to categorize their business. One owner may think that putting 'Mexican' first increases business while another owner may think the same by putting 'Cocktail Bars' first.

El Carmen: https://www.yelp.com/biz/el-carmen-los-angeles?osq=mexican+bar

Toca Madera: https://www.yelp.com/biz/toca-madera-los-angeles?osq=mexican+bar

Getting Data from Yelp's API

Note that the following data fetching code using the Yelp API has not been run for our submission output since it is VERY time consuming. We ran the code once in order to output a dataframe which we saved locally as a csv file. We then loaded the csv file for our analyses later on.

In [ ]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
violations = pd.read_csv("Restaurant_and_Market_Health_Violations.csv")

dropcols = ['serial_number','violation_status','facility_id', 'facility_city', 'facility_state',
       'employee_id', 'owner_id', 'owner_name', 'program_element_pe', 'program_name',
       'program_status', 'record_id', 'service_code', 'row_id']

df = violations.drop(columns=dropcols)
In [ ]:
df["year"] = pd.to_datetime(df["activity_date"]).dt.year
df_2018 = df[df["year"] == 2018]
In [ ]:
import requests
import json


tt = df_2018[["facility_name","facility_address"]]
yy = tt.drop_duplicates(subset='facility_name').reset_index(drop=True)
address = yy['facility_address']
names = yy['facility_name']
In [ ]:
api_key='PYQ62khmD2Sjrfmd5ZCzI7tifXh5QhhdvjyZJB3IVdDaJWBnwyDeMFEa8xgfJv1ZsjU-pNJdLqNTKiK3yIbkFreySmVpnytNNWTa2VrcdNTAgDV-ZuapEGolJr3cXXYx'
headers = {'Authorization': 'Bearer %s' % api_key}

Here we find the IDs and addresses for as many of our businesses there are on Yelp.

In [ ]:
url='https://api.yelp.com/v3/businesses/matches'

add = pd.Series([]) #restaurant address
rid = pd.Series([]) #restaurant id
for i in range(4000): #split the for loop into two because of Yelp API's daily fetching limit of 5000
    if i % 50 == 0: #used to track progress of loop. Each loop took approximately TWO hours.
        print(i)
    params={'name':names[i],"address1": address[i],"address2": "","address3": "",
     "city": "Los Angeles",
     "state": "CA",
     "country": "US"}
    req=requests.get(url, params=params, headers=headers)
    a = json.loads(req.text)
    try:
        b = a["businesses"]
        rid[i] = b[0]['id'] #Get restaurant ID
        add[i] = b[0]['location']['address1'] #Get restaurant address
    except: #If an error occurs or the restaurant does not exist on Yelp, we label it as "Cannot Find"
        rid[i] = "Cannot Find"
        add[i] = "Cannot Find"  
In [ ]:
#2nd key
api_key2='xMEMOIYXVK3qxLtT_x0WbyQcah3ZP5dqmRM2cPYrFKZ0KJA5UvwDmHbJtZwJP2GS4HqlvQ63iKgUkfpvofRak9A0_lOpSUERY66rZxUfKb3dG6_5ZkcrHYJgB7_cXXYx'
headers2 = {'Authorization': 'Bearer %s' % api_key2} 
In [ ]:
#2nd loop
url='https://api.yelp.com/v3/businesses/matches'

for i in range(4000,8517):
    if i % 50 == 0:
        print(i)
    params={'name':names[i],"address1": address[i],"address2": "","address3": "",
     "city": "Los Angeles",
     "state": "CA",
     "country": "US"}
    req=requests.get(url, params=params, headers=headers2)
    a = json.loads(req.text)
    try:
        b = a["businesses"]
        rid[i] = b[0]['id']
        add[i] = b[0]['location']['address1']
    except:
        rid[i] = "Cannot Find"
        add[i] = "Cannot Find"      
In [ ]:
url = 'https://api.yelp.com/v3/businesses/'+rid[8516] #Ethnicity always shows up as the first category
req=requests.get(url, headers=headers)
a = json.loads(req.text)
a
In [ ]:
#3rd key
api_key='itYKMeAN8zm-Y5_bHjaPJPRgLyr5M74mrr1qO1cPwXEAn4gGYi0twN_dgCiimGc0Ni3xGUZWjA-QMkRDWsENiAnPZqYkxVFZgTNxbVA2tT0J1e-cBZB3_ATsqrXdXXYx'
headers = {'Authorization': 'Bearer %s' % api_key} 

Now, using the IDs and address, we collect restaurant name, rating, price, and type from yelp's business search API

In [ ]:
url = 'https://api.yelp.com/v3/businesses/'

restaurant = pd.Series([])
rating = pd.Series([])
price = pd.Series([])
typeofrest = pd.Series([])
for i in range(4000): #again split the loop into two
    if i % 50 == 0: #used to track progress of loop. Each loop took approximately THREE hours.
        print(i)
    if rid[i] == "Cannot Find":
        restaurant[i] = ("Cannot Find")
        rating[i] = ("Cannot Find")
        price[i] = ("Cannot Find")
        typeofrest[i] = ("Cannot Find")
    else:
        url = 'https://api.yelp.com/v3/businesses/'+rid[i]
        req=requests.get(url, headers=headers)
        a = json.loads(req.text)
        try:
            b = a['location']['address1'] 
            if add[i] == b: #Here we make sure that the business we fetch from Yelp API matches the one we collected earlier
                try:
                    restaurant[i] = a["name"]
                    rating[i] = a["rating"]
                    price[i] = a["price"]
                    typeofrest[i] = a["categories"][0]["title"]
                except: 
                    restaurant[i] = ("Cannot Find")
                    rating[i] = ("Cannot Find")
                    price[i] = ("Cannot Find")
                    typeofrest[i] = ("Cannot Find")
            else:
                restaurant[i] = ("Cannot Find")
                rating[i] = ("Cannot Find")
                price[i] = ("Cannot Find")
                typeofrest[i] = ("Cannot Find")
        except:
            restaurant[i] = ("Cannot Find")
            rating[i] = ("Cannot Find")
            price[i] = ("Cannot Find")
            typeofrest[i] = ("Cannot Find")
In [ ]:
#4th key
api_key='VistynnaqvQR5dCHJtJugLBsgMKYcUnkv0cxOXphGqZ3sgHS2QCrZokS9pWncR_zQgqefkVxjvzlz_BPKlQO7ch0Z8QypcQcIAZgxYW42QMNvdTb-ay7KnvmpGzYXXYx'
headers = {'Authorization': 'Bearer %s' % api_key} 
In [ ]:
#2nd loop
url = 'https://api.yelp.com/v3/businesses/'

for i in range(4000,8517):
    if i % 50 == 0:
        print(i)
    if rid[i] == "Cannot Find":
        restaurant[i] = ("Cannot Find")
        rating[i] = ("Cannot Find")
        price[i] = ("Cannot Find")
        typeofrest[i] = ("Cannot Find")
    else:
        url = 'https://api.yelp.com/v3/businesses/'+rid[i]
        req=requests.get(url, headers=headers)
        a = json.loads(req.text)
        try:
            b = a['location']['address1']
            if add[i] == b:
                try:
                    restaurant[i] = a["name"]
                    rating[i] = a["rating"]
                    price[i] = a["price"]
                    typeofrest[i] = a["categories"][0]["title"]
                except: 
                    restaurant[i] = ("Cannot Find")
                    rating[i] = ("Cannot Find")
                    price[i] = ("Cannot Find")
                    typeofrest[i] = ("Cannot Find")
            else:
                restaurant[i] = ("Cannot Find")
                rating[i] = ("Cannot Find")
                price[i] = ("Cannot Find")
                typeofrest[i] = ("Cannot Find")
        except:
            restaurant[i] = ("Cannot Find")
            rating[i] = ("Cannot Find")
            price[i] = ("Cannot Find")
            typeofrest[i] = ("Cannot Find")

Output our dataframe to a csv file

In [ ]:
pulled = pd.DataFrame(list(zip(restaurant, rating, price, typeofrest)))
pulled.index = rating.index

pulled.to_csv('yelp_all2018.csv')

First we load the original data set and verify it loaded correctly.

In [240]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from matplotlib.pyplot import figure
import seaborn as sns
from scipy import stats
from statsmodels.stats.proportion import proportions_ztest
from statsmodels.stats.proportion import proportions_chisquare

data = pd.read_csv('Restaurant_and_Market_Health_Violations.csv')
pd.set_option('display.max_columns', None)
In [241]:
data.head()
Out[241]:
serial_number activity_date facility_name violation_code violation_description violation_status points grade facility_address facility_city facility_id facility_state facility_zip employee_id owner_id owner_name pe_description program_element_pe program_name program_status record_id score service_code service_description row_id
0 DA08R0TCU 03/30/2018 12:00:00 AM KRUANG TEDD F030 # 30. Food properly stored; food storage conta... OUT OF COMPLIANCE 1 A 5151 HOLLYWOOD BLVD LOS ANGELES FA0064949 CA 90027 EE0000857 OW0001034 5151 HOLLYWOOD LLC RESTAURANT (31-60) SEATS HIGH RISK 1635 KRUANG TEDD ACTIVE PR0031205 92 1 ROUTINE INSPECTION DA08R0TCUF030
1 DA08R0TCU 03/30/2018 12:00:00 AM KRUANG TEDD F027 # 27. Food separated and protected OUT OF COMPLIANCE 1 A 5151 HOLLYWOOD BLVD LOS ANGELES FA0064949 CA 90027 EE0000857 OW0001034 5151 HOLLYWOOD LLC RESTAURANT (31-60) SEATS HIGH RISK 1635 KRUANG TEDD ACTIVE PR0031205 92 1 ROUTINE INSPECTION DA08R0TCUF027
2 DA08R0TCU 03/30/2018 12:00:00 AM KRUANG TEDD F035 # 35. Equipment/Utensils - approved; installed... OUT OF COMPLIANCE 1 A 5151 HOLLYWOOD BLVD LOS ANGELES FA0064949 CA 90027 EE0000857 OW0001034 5151 HOLLYWOOD LLC RESTAURANT (31-60) SEATS HIGH RISK 1635 KRUANG TEDD ACTIVE PR0031205 92 1 ROUTINE INSPECTION DA08R0TCUF035
3 DA08R0TCU 03/30/2018 12:00:00 AM KRUANG TEDD F033 # 33. Nonfood-contact surfaces clean and in go... OUT OF COMPLIANCE 1 A 5151 HOLLYWOOD BLVD LOS ANGELES FA0064949 CA 90027 EE0000857 OW0001034 5151 HOLLYWOOD LLC RESTAURANT (31-60) SEATS HIGH RISK 1635 KRUANG TEDD ACTIVE PR0031205 92 1 ROUTINE INSPECTION DA08R0TCUF033
4 DA08R0TCU 03/30/2018 12:00:00 AM KRUANG TEDD F029 # 29. Toxic substances properly identified, st... OUT OF COMPLIANCE 1 A 5151 HOLLYWOOD BLVD LOS ANGELES FA0064949 CA 90027 EE0000857 OW0001034 5151 HOLLYWOOD LLC RESTAURANT (31-60) SEATS HIGH RISK 1635 KRUANG TEDD ACTIVE PR0031205 92 1 ROUTINE INSPECTION DA08R0TCUF029

Next, we load the Yelp dataset and verify that it loaded cleanly.

In [242]:
yelp = pd.read_csv('yelp_all2018.csv')
yelp['0'] = yelp['0'].str.upper()
yelp = yelp.drop(columns = ['Unnamed: 0'])
yelp = yelp.rename(columns={"0": "facility_name", "1": "rating", '2': 'price_category', '3': 'type'})
yelp.head()
Out[242]:
facility_name rating price_category type
0 KRUANG TEDD 3.5 $$ Thai
1 SPROUTS FARMERS MARKET 3.0 $$ Grocery
2 LOS ANGELES NEW YORK PIZZA 3.0 $ Pizza
3 CANNOT FIND Cannot Find Cannot Find Cannot Find
4 CREPE X-PRESS CAFE 4.5 $ Creperies

We merge the two datasets on the 'facility_name' column using a left join.

In [243]:
#Combine the two datasets and sort alphabetically by facility name
data1 = pd.merge(data, yelp, on='facility_name', how='left')
data1 = data1.sort_values(by = ['facility_name'])

data1.head() 
Out[243]:
serial_number activity_date facility_name violation_code violation_description violation_status points grade facility_address facility_city facility_id facility_state facility_zip employee_id owner_id owner_name pe_description program_element_pe program_name program_status record_id score service_code service_description row_id rating price_category type
118154 DAJFACZWZ 02/09/2017 12:00:00 AM #1 CAFE F030 # 30. Food properly stored; food storage conta... OUT OF COMPLIANCE 1 A 2080 CENTURY PARK E STE 108 LOS ANGELES FA0056432 CA 90067 EE0000375 OW0000002 #1 CAFE INC RESTAURANT (0-30) SEATS HIGH RISK 1632 #1 CAFE ACTIVE PR0045100 90 1 ROUTINE INSPECTION DAJFACZWZF030 NaN NaN NaN
8924 DACZXQ74W 03/01/2018 12:00:00 AM #1 CAFE F007 # 07. Proper hot and cold holding temperatures OUT OF COMPLIANCE 2 A 2080 CENTURY PARK E STE 108 LOS ANGELES FA0056432 CA 90067 EE0000375 OW0000002 #1 CAFE INC RESTAURANT (0-30) SEATS HIGH RISK 1632 #1 CAFE ACTIVE PR0045100 90 1 ROUTINE INSPECTION DACZXQ74WF007 NaN NaN NaN
8925 DACZXQ74W 03/01/2018 12:00:00 AM #1 CAFE F033 # 33. Nonfood-contact surfaces clean and in go... OUT OF COMPLIANCE 1 A 2080 CENTURY PARK E STE 108 LOS ANGELES FA0056432 CA 90067 EE0000375 OW0000002 #1 CAFE INC RESTAURANT (0-30) SEATS HIGH RISK 1632 #1 CAFE ACTIVE PR0045100 90 1 ROUTINE INSPECTION DACZXQ74WF033 NaN NaN NaN
118147 DAJFACZWZ 02/09/2017 12:00:00 AM #1 CAFE F035 # 35. Equipment/Utensils - approved; installed... OUT OF COMPLIANCE 1 A 2080 CENTURY PARK E STE 108 LOS ANGELES FA0056432 CA 90067 EE0000375 OW0000002 #1 CAFE INC RESTAURANT (0-30) SEATS HIGH RISK 1632 #1 CAFE ACTIVE PR0045100 90 1 ROUTINE INSPECTION DAJFACZWZF035 NaN NaN NaN
8927 DACZXQ74W 03/01/2018 12:00:00 AM #1 CAFE F036 # 36. Equipment, utensils and linens: storage ... OUT OF COMPLIANCE 1 A 2080 CENTURY PARK E STE 108 LOS ANGELES FA0056432 CA 90067 EE0000375 OW0000002 #1 CAFE INC RESTAURANT (0-30) SEATS HIGH RISK 1632 #1 CAFE ACTIVE PR0045100 90 1 ROUTINE INSPECTION DACZXQ74WF036 NaN NaN NaN

Notice that rating, price_category, and type have NaN's, meaning they were not listed/were not accessible on Yelp.

Cleaning the Dataset

  • Dropping columns
  • Adding new columns
  • Reindexing
  • Tidying up fields
  • Dropping duplicates
  • Drop NaN's
  • Fixing zipcodes with 4 extra numbers (e.g. 90024-xxxx)
  • Variable creation
In [244]:
#Reindex
data1 = data1.reset_index()

#Create date column
data1['year'] = data1['activity_date'].str[6:10]

#Remove unnecessary columns
data1 = data1.drop(columns = ['index','serial_number','violation_status','facility_id', 'facility_city', 'facility_state',
       'employee_id', 'owner_id', 'owner_name', 'program_element_pe', 'program_name',
       'program_status', 'service_code', 'row_id','service_description'])
In [245]:
#Tidy year column
data1['year'] = pd.to_numeric(data1['year'])
data1['year'].dtype
Out[245]:
dtype('int64')
In [246]:
#Drop duplicates
data1 = data1.drop_duplicates(subset=['facility_name','activity_date','violation_code','facility_address'])
In [247]:
#Standardize zipcodes
data1['facility_zip'] = data1['facility_zip'].str[0:5]
data1['facility_zip'].unique()
Out[247]:
array(['90067', '90044', '90003', '90004', '90022', '90034', '90061',
       '90020', '90008', '90033', '90045', '90017', '90012', '90011',
       '90036', '90068', '90002', '90015', '90063', '90037', '90018',
       '90027', '90006', '90007', '90035', '90010', '90056', '90071',
       '90028', '90005', '90066', '90026', '90064', '90029', '90062',
       '90043', '90013', '90041', '90039', '90042', '90014', '90057',
       '90031', '90047', '90065', '90019', '90016', '90021', '90025',
       '90046', '90038', '90059', '90024', '90032', '90058', '90023',
       '90001', '90048', '90049', '90220', '90069', '90040', '90077',
       '90095', '90079', '90089', '91324', '90222', '90291', '92880',
       '90094', '90232', '90230', '91307', '90921', '90054', '91601',
       '90030', '90103', '90255', '90070', '91604', '90052', '91602',
       '90293', '91505', '91325', '90211', '90272', '91326', '90051',
       '90201', '90292', '91340', '90723', '91401', '91367', '91006'],
      dtype=object)

We will only be focusing on entires in 2018 that our yelp API could pull

In [248]:
data = data1.loc[data1['year'] == 2018]
data = data.reset_index()
data = data.drop(columns = 'index')

#Drop NaN's
data = data.dropna()
data = data.reset_index()
data = data.drop(columns = 'index')
data.shape #dimensions
Out[248]:
(19788, 15)

We decide to drop any NaN values in order to facilitate our analyses as 19,788 entries is plenty for the purposes of this project.

In [249]:
#Create a new column 'major' which indicates if a row is a major violation or not (1:major and 0:minor)
data['major'] = np.where(data['points']>=4, 1, 0)

#Create a new column 'num_rating' which is simply the rating column but in type float
data['num_rating'] = [float(i) for i in list(data["rating"])]

Here is a preview of our finalized dataset.

In [250]:
data.head()
Out[250]:
activity_date facility_name violation_code violation_description points grade facility_address facility_zip pe_description record_id score rating price_category type year major num_rating
0 07/30/2018 12:00:00 AM 101 COFFEE SHOP F014 # 14. Food contact surfaces: clean and sanitized 2 A 6145 FRANKLIN AVE 90068 RESTAURANT (61-150) SEATS HIGH RISK PR0013735 94 3.5 $$ Diners 2018 0 3.5
1 07/30/2018 12:00:00 AM 101 COFFEE SHOP F039 # 39. Wiping cloths: properly used and stored 1 A 6145 FRANKLIN AVE 90068 RESTAURANT (61-150) SEATS HIGH RISK PR0013735 94 3.5 $$ Diners 2018 0 3.5
2 07/30/2018 12:00:00 AM 101 COFFEE SHOP F043 # 43. Premises; personal/cleaning items; vermi... 1 A 6145 FRANKLIN AVE 90068 RESTAURANT (61-150) SEATS HIGH RISK PR0013735 94 3.5 $$ Diners 2018 0 3.5
3 04/25/2018 12:00:00 AM 101 COFFEE SHOP F014 # 14. Food contact surfaces: clean and sanitized 2 A 6145 FRANKLIN AVE 90068 RESTAURANT (61-150) SEATS HIGH RISK PR0013735 95 3.5 $$ Diners 2018 0 3.5
4 04/25/2018 12:00:00 AM 101 COFFEE SHOP F030 # 30. Food properly stored; food storage conta... 1 A 6145 FRANKLIN AVE 90068 RESTAURANT (61-150) SEATS HIGH RISK PR0013735 95 3.5 $$ Diners 2018 0 3.5
In [251]:
data.columns
Out[251]:
Index(['activity_date', 'facility_name', 'violation_code',
       'violation_description', 'points', 'grade', 'facility_address',
       'facility_zip', 'pe_description', 'record_id', 'score', 'rating',
       'price_category', 'type', 'year', 'major', 'num_rating'],
      dtype='object')
In [252]:
data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 19788 entries, 0 to 19787
Data columns (total 17 columns):
activity_date            19788 non-null object
facility_name            19788 non-null object
violation_code           19788 non-null object
violation_description    19788 non-null object
points                   19788 non-null int64
grade                    19788 non-null object
facility_address         19788 non-null object
facility_zip             19788 non-null object
pe_description           19788 non-null object
record_id                19788 non-null object
score                    19788 non-null int64
rating                   19788 non-null object
price_category           19788 non-null object
type                     19788 non-null object
year                     19788 non-null int64
major                    19788 non-null int32
num_rating               19788 non-null float64
dtypes: float64(1), int32(1), int64(3), object(12)
memory usage: 2.5+ MB

It is important to note that the "Score" column contains Health Scores extracted from the Los Angeles Department of Public Health whereas the "Rating" column contains ratings from Yelp users extracted from the Yelp website.

2) Exploratory Data Analysis

Variable Distributions

Let's examine the distribution of our variables as a high level overview before our in-depth analysis. These distributions represent the amount of violations by variable since each row of our dataset is a violation.

Violation Code

In [253]:
fig= plt.figure(figsize=(8,9))
data.violation_code.value_counts().plot(kind = "barh")
plt.ylabel('Violation Code')
plt.xlabel('Violations')
plt.title('Count of Violations by Violation Code')
plt.tight_layout()
plt.gca().invert_yaxis()
plt.show()
In [254]:
data.violation_code.value_counts().head()
Out[254]:
F044    2112
F033    1905
F035    1760
F036    1274
F043    1083
Name: violation_code, dtype: int64

The most common violation given out in 2018 is violation F044 which signifies "Floors, walls and ceiling: built, maintained and clean" on the reference guide (http://publichealth.lacounty.gov/eh/docs/WhatsNew/FOIR-RefGuide.pdf). This category is marked as out of compliance if anything about the structure of the building is unsatisfactory including but not limited to: chipped walls, missing outlets, unapproved floor boards on the ground, dirt, etc. F033 is the second most common violation and this is marked if 'nonfood contact surfaces' are unclean. This may include: dirty fan guards, food debris on shelves, and chemical buildup in the mechanical warewasher unit. F035 is the third most common violation and this category signifies anything out of the ordinary with respect to equipment and utensils. A violation can be handed out under this category for any of the following reasons: using uncertified equipment, using 'household only' wares, or using lead glazed containers. For a full list of violation reasons, please refer to the above reference link.

Points

In [255]:
data.points.value_counts().plot(kind = "bar")
plt.xticks(rotation = 0)
plt.xlabel('Violation Points')
plt.ylabel('Violations')
plt.title('Count of Violations by Violation Points')
plt.show()
In [256]:
data.points.value_counts()
Out[256]:
1     15037
2      3542
4       824
0       271
3        60
11       54
Name: points, dtype: int64

Most violations handed out were for minor violations (indicated by a 1 point violation). A 0 point violation indicates that the health inspector took down a note of some non-negative attribute for the establishment. An 11 point violation indicates a shutdown of that establishment due to a major health concern.

Major

In [257]:
data.groupby(['major'])['record_id'].count().plot(kind = 'bar')
plt.xticks(np.arange(2), labels = ('No', 'Yes'), rotation = 0)
plt.ylabel('Violations')
plt.xlabel('Major Violation')
plt.title('Count of Violations by Major Violations')
plt.show()
In [258]:
data.groupby(['major'])['record_id'].count()
Out[258]:
major
0    18910
1      878
Name: record_id, dtype: int64

We notice here that most violations were minor. This is a more categorized representation of the count of violations by violation points.

Description of Establishment

In [259]:
fig= plt.figure(figsize=(12,8))
data.pe_description.value_counts().plot(kind = "barh")
plt.xticks(rotation = 0)
plt.xlabel('Violations')
plt.ylabel('Establishment Description')
plt.title('Count of Violations by Establishment Description')
plt.gca().invert_yaxis()
plt.show()
In [260]:
data.pe_description.value_counts().head(4)
Out[260]:
RESTAURANT (0-30) SEATS HIGH RISK        5652
RESTAURANT (31-60) SEATS HIGH RISK       4622
RESTAURANT (61-150) SEATS HIGH RISK      3094
RESTAURANT (0-30) SEATS MODERATE RISK    3068
Name: pe_description, dtype: int64

Most violations were handed out to high risk restaurants with seating capacities of 150 or below, with more violations handed out to restaurants with lower seating capacities. This is most likely due to the high risk nature of smaller capacity restaurants with regards to their smaller infrastructure and reduced operating area.

Price Category

In [261]:
price = data.price_category.value_counts()
price_df = pd.DataFrame(list(price)).plot(kind = "bar", legend = None)
price_df.set_xticklabels(["\$$","\$","\$$\$","\$$\$\$"])
plt.xticks(rotation = 0)
plt.xlabel('Price Category')
plt.ylabel('Violations')
plt.title('Count of Violations by Price Category')
plt.show()
In [262]:
data.price_category.value_counts()
Out[262]:
$$      10101
$        8766
$$$       729
$$$$      192
Name: price_category, dtype: int64

Most violations were handed out to establishments with a 2 dollar and 1 dollar sign on Yelp. Pricier establishments may maintain higher standards including health standards which could explain the lower amount of violations for them.

Establishment Type (count of violations)

In [263]:
fig= plt.figure(figsize=(23,8))
data.type.value_counts().plot(kind = "bar")
plt.tight_layout()
plt.ylabel('Violations')
plt.xlabel('Establishment Type')
plt.title('Count of Violations by Establishment Type')
plt.show()
In [264]:
data.type.value_counts().head(10).plot(kind = "barh")
plt.gca().invert_yaxis()
plt.tight_layout()
plt.xlabel('Violations')
plt.ylabel('Establishment Type')
plt.title('Count of Violations by Establishment Type')
plt.show()

Above we examine the number of violations by establishment type. Most violations handed out in 2018 were to Mexican, Chinese, Coffee & Tea, Korean, and Bakery type establishments with Mexican establishments having a noticeable top spot. Let us examine this again by normalizing this through looking at the ratio of violation count to establishment count by each establishment type.

In [265]:
fig= plt.figure(figsize=(29,8))
plt.tight_layout()
#number of violations
(data.groupby(['type'])['record_id'].count().sort_values(ascending = False)\
/data.groupby(['type'])['facility_name'].nunique().sort_values(ascending = False)).sort_values(ascending = False).plot(kind = 'bar')
#number of establishments
plt.ylabel('Violations/Establishments')
plt.xlabel('Establishment Type')
plt.title('Violations/Establishments by Establishment Type')
plt.show()
In [266]:
(data.groupby(['type'])['record_id'].count().sort_values(ascending = False)\
/data.groupby(['type'])['facility_name'].nunique().sort_values(ascending = False)).sort_values(ascending = False).head(10).plot(kind = 'barh')
plt.gca().invert_yaxis()
plt.xlabel('Violations/Establishments')
plt.ylabel('Establishment Type')
plt.title('Violations/Establishments by Establishment Type')
plt.show()

When we examine the proportion of violations to the amount of establishments by establishment type, we see a new set of rankings. Establishments categorized as "Stadiums & Arenas" rank the highest with a value of 39. This means that for every Stadium & Arena establishment, there are 39 violations associated with it. We also see "Fast Food" ranked in the top 10 as this category was not present when we only examined the amount of violations by establishment type. Mexican establishments have a value of 8.39 here which is not extreme when compared to the median. This means that although Mexican establishments have a very high number of individual violations, there is a relatively higher amount of Mexican establishments in general.

Establishment type (sum of violation points)

In [267]:
fig= plt.figure(figsize=(29,8))
plt.tight_layout()
data.groupby(['type'])['points'].sum().sort_values(ascending = False).plot(kind = 'bar')
plt.ylabel("Sum of Violation Points")
plt.xlabel("Establishment Type")
plt.title("Sum of Violation Points by Establishment Type")

plt.show()
In [268]:
data.groupby(['type'])['points'].sum().sort_values(ascending = True).tail(10).plot(kind = 'barh')
plt.xlabel("Sum of Violation Points")
plt.ylabel("Establishment Type")
plt.title("Sum of Violation Points by Establishment Type")

plt.show()

These are the top 10 establishment types with the highest number of violation points. Mexican, Chinese, and Coffee & Tea establishments have the highest count of violation points in 2018. The pattern here matches that of the top establishment types with the highest number of individual violations so we see a correlation between violations and violation points. Let's examine the data again if we normalize it by taking the ratio of violation points to number of establishments by each establishment type.

In [269]:
fig= plt.figure(figsize=(29,8))
plt.tight_layout()
#sum of violation points
(data.groupby(['type'])['points'].sum().sort_values(ascending = False)\
/data.groupby(['type'])['facility_name'].nunique().sort_values(ascending = False)).sort_values(ascending = False).plot(kind = 'bar')
#number of establishments
plt.ylabel('Violation Points/Establishments')
plt.xlabel('Establishment Type')
plt.title('Violation Points/Establishments by Establishment Type')
plt.show()
In [270]:
#sum of violation points
(data.groupby(['type'])['points'].sum().sort_values(ascending = False)\
/data.groupby(['type'])['facility_name'].nunique().sort_values(ascending = False)).sort_values(ascending = False).head(10).plot(kind = 'barh')
#number of establishments
plt.gca().invert_yaxis()
plt.xlabel('Violation Points/Establishments')
plt.ylabel('Establishment Type')
plt.title('Violation Points/Establishments by Establishment Type')
plt.show()

The establishments with the top proportions of sum of violation points to number of establishments generally match the establishments with the top proportions of individual violations to number of establishments. There are some changes however. Singaporean establishments ranked 5th with regards to individual violations:number of establishments but ranked 3rd in sum of violation points:number of establishments. This means that although they had relatively less individual violations, they had more severe violations which is measured by their higher sum of violation points.

It is important to note that there are some establishment types, such as Men's Clothing, Day Spas, and Home Decor (to name a few), listed that do not correlate with the selling of food/ drinks. The reason why they appear in our data is due to these establishments selling food/ drinks in one form or another, whether it be through primary or secondary means. As an example, the establishment categorized as "Home Decor" is Cost Plus World Market, which mainly sells furniture but also sells snacks and alcohol. As long as these establishments sell perishables they are subject to health inspections as well.

Score

In [271]:
data.score.hist(bins=30)
plt.xlabel('Health Score')
plt.title('Distribution of Health Score by Violations')
plt.ylabel('Violations')
plt.show()

Here we notice something interesting with the distribution of health scores by violations. There are peaks at health scores of 90 and 80. We hypothesize here that health inspectors are more likely to give these borderline scores in order to prevent the establishment from receiving the next grade down (ex: a 90 would be a borderline A and an 80 would be a borderline B). This could be a reflection of bias on the health inspectors' part.

Grade

In [303]:
grades = data.groupby(['grade'])['record_id'].count().sort_values(ascending=False)
grades.plot(kind = 'bar')
plt.title('Count of Violations by Grade')
plt.ylabel('Violations')
plt.xticks(rotation=0)
plt.xlabel('Grade')
plt.show()

We may want to see how the city of Los Angeles fairs on a holistic scale with regards to grade distributions. Here, grade A contains the most number of inspections (17713) followed by grade B (1861) and grade C (214). Thus for the year 2018, around 90% of inspections resulted in a grade A ranking which gives us some relief in choosing where to eat.

Rating

In [273]:
data.rating.value_counts().plot(kind = "bar")
plt.xticks(rotation = 0)
plt.ylabel('Violations')
plt.xlabel('Yelp Rating')
plt.title('Count of Violations by Yelp Rating')
plt.show()
In [274]:
data.rating.value_counts()
Out[274]:
4.0    6891
3.5    5133
4.5    3591
3.0    2271
2.5     845
2.0     463
5.0     359
1.5     206
1.0      29
Name: rating, dtype: int64

The majority of violations were handed out to establishments that had Yelp ratings between 3.0 and 4.5 stars. This is mainly due to how the majority of establishments fall between this range in terms of Yelp ratings.

Zipcode

In [275]:
fig= plt.figure(figsize=(25,7))
data.facility_zip.value_counts().plot(kind = "bar")
plt.ylabel('Violations')
plt.xlabel('Zip Code')
plt.title('Count of Violations by Zip Code')
plt.show()
In [276]:
data.groupby(['facility_zip'])['violation_code'].count().sort_values(ascending=False).head(10).plot(kind = 'bar')
plt.xticks(rotation = 30)
plt.ylabel('Count of Violations')
plt.xlabel('Zip Code')
plt.title('Count of Violations by Zip Code')

plt.show()

Right off the bat, we notice the top 10 areas (defined by zip codes) that contained the highest number of violations in 2018. 90012, 90048, 90020, 90005, and 90028 were the top 5. 90012 is the area with the highest amount of violations and we see that this is the zip code for Chinatown and Japantown. This inspired us to examine the facilities by the amount of violations received and to investigate the relationships between our other variables. Before doing so, we decide to normalize our analysis by examining the ratio of violations to number of establishments by zipcode.

In [277]:
fig= plt.figure(figsize=(25,7))
(data.groupby(['facility_zip'])['record_id'].count().sort_values(ascending = False)\
/data.groupby(['facility_zip'])['facility_name'].nunique().sort_values(ascending = False)).sort_values(ascending = False).plot(kind = 'bar')
plt.ylabel('Violations/Establishments')
plt.xlabel('Zip Code')
plt.title('Violations/Establishments by Zip Code')
plt.show()
In [278]:
(data.groupby(['facility_zip'])['record_id'].count().sort_values(ascending = False)\
/data.groupby(['facility_zip'])['facility_name'].nunique().sort_values(ascending = False)).sort_values(ascending = False).head(10).plot(kind = 'bar')
plt.ylabel('Violations/Establishments')
plt.xlabel('Zip Code')
plt.title('Violations/Establishments by Zip Code')
plt.xticks(rotation = 30)
plt.show()

The zipcode with the highest ratio of number of violations to number of establishments by zip code is 90103. The zipcode for Westwood and UCLA (90024) is listed here, which prompts us to investigate this are further below. Chinatown/Japantown is not listed in the top 10 here; the zipcode for this area is ranked 21 out of 78 unique zipcodes.

Chinatown and Japantown

In [279]:
chinatown = data.loc[data.facility_zip == '90012']
chinatown.facility_name.nunique()
Out[279]:
97

97 unique establishments in Chinatown were handed health violations in 2018. Let us see which establishments received the most and least violations.

In [280]:
chinatown.groupby(['facility_name', 'facility_address'])['record_id'].count().sort_values(ascending=False).head(5)
Out[280]:
facility_name             facility_address                    
TORIGOYA                  123 ASTRONAUT E S ONIZUKA ST STE 203    25
KOURAKU RESTAURANT        314 E 2ND ST                            24
LUCKY DELI                706 N BROADWAY WAY                      18
OCEAN PACIFIC RESTAURANT  859 N BROADWAY                          18
FORTUNE GOURMET KITCHEN   727 N BROADWAY STE 215                  17
Name: record_id, dtype: int64
In [281]:
chinatown.groupby(['facility_name', 'facility_address'])['record_id'].count().sort_values(ascending=False).tail(5)
Out[281]:
facility_name          facility_address
WON WON MINI MARKET    700 N BROADWAY      2
SAKE DOJO              333 E 1ST ST        2
BAK'S KITCHEN          702 W 1ST ST        2
CAFECITO PURA VIDA     306 W 2ND ST        1
LITTLE TOKYO PHARMACY  402 E 2ND ST        1
Name: record_id, dtype: int64
In [282]:
table = pd.pivot_table(chinatown, values='record_id', index=['facility_name'],
                    columns=['grade'], aggfunc='count').fillna(0).sort_values(by = 'B', ascending = False).head(5)
table
Out[282]:
grade A B
facility_name
LUCKY DELI 0.0 18.0
FORTUNE GOURMET KITCHEN 0.0 17.0
WON KOK RESTAURANT 0.0 16.0
LILIYA CHINA BISTRO 0.0 13.0
KOURAKU RESTAURANT 11.0 13.0

Lucky Deli appears to be the establishment that received the most violations that resulted in a B rating. Let's examine the types of violations that were given.

In [283]:
facilities = "LUCKY DELI"
lucky = chinatown.loc[chinatown.facility_name == facilities]
lucky.violation_description.unique()
Out[283]:
array(['# 04. Proper eating, drinking, or tobacco use',
       '# 05. Hands clean and properly washed; gloves used properly',
       '# 07. Proper hot and cold holding temperatures',
       '# 14. Food contact surfaces: clean and sanitized',
       '# 26. Approved thawing methods used, frozen food maintained frozen',
       '# 43. Premises; personal/cleaning items; vermin-proofing',
       '# 52. Multiple Major Critical Violations / Increased Risk to Public Health',
       '# 39. Wiping cloths: properly used and stored',
       '# 25. Personal cleanliness and hair restraints',
       '# 30. Food properly stored; food storage containers identified',
       '# 35. Equipment/Utensils - approved; installed; clean; good repair, capacity',
       '# 37. Adequate ventilation and lighting; designated areas, use',
       '# 42. Toilet facilities: properly constructed, supplied, cleaned'],
      dtype=object)

A full reference guide for these violations can be found on http://publichealth.lacounty.gov/eh/docs/WhatsNew/FOIR-RefGuide.pdf. Unfortunately, our dataset does not list the specific reason for the violation under each code so we cannot examine this more closely. From this we move on to the rest of Los Angeles and examine the breakdown of inspections by grade type.

Westwood

We remember that 90024 is the zip code encompassing the Westwood area. Therefore we decide to examine this further as many students and faculty from UCLA are likely to go to these locations.

In [284]:
westwood = data.loc[data.facility_zip == '90024']
westwood.facility_name.nunique()
Out[284]:
68

68 unique establishments in Westwood were handed health violations in 2018. Let us see which establishments received the most and least violations.

In [285]:
westwood.groupby(['facility_name', 'facility_address'])['record_id'].count().sort_values(ascending=False).head(5)
Out[285]:
facility_name       facility_address        
MINISTRY OF COFFEE  1010 GLENDON AVE            30
BARNEY'S BEANERY    1037 BROXTON AVE            29
PARIS BAKERY        1448 WESTWOOD BLVD          24
PINCHES TACOS       1030 GLENDON AVE            23
NAPA VALLEY GRILLE  1100 GLENDON AVE STE 100    22
Name: record_id, dtype: int64
In [286]:
westwood.groupby(['facility_name', 'facility_address'])['record_id'].count().sort_values(ascending=False).tail(5)
Out[286]:
facility_name     facility_address        
EARTHBAR          10960 WILSHIRE BLVD         3
THE BOILING CRAB  10875 W KINROSS AVE         3
BELLA PITA        960 GAYLEY AVE              3
KIKKA SUSHI       308 WESTWOOD PLZ # A233B    3
CARL'S JR         308 WESTWOOD PLZ 505        1
Name: record_id, dtype: int64
In [287]:
table = pd.pivot_table(westwood, values='record_id', index=['facility_name'],
                    columns=['grade'], aggfunc='count').fillna(0).sort_values(by = 'B', ascending = False).head(5)
table
Out[287]:
grade A B
facility_name
NAPA VALLEY GRILLE 5.0 17.0
MINISTRY OF COFFEE 15.0 15.0
BARNEY'S BEANERY 16.0 13.0
TANINO 6.0 13.0
THAI FRESH CAFE 2.0 12.0

We notice that there were no establishments in Westwood that were given a C grade in 2018. Therefore we examine the establishments that were given the most violations that resulted in a B rating. Interestingly, the Carl's Jr. and Kikka Sushi in Ackerman Union received the least amount of health violations in 2018 with respect to the 90024 area. These UCLA establishments manage to maintain good health practices while servicing hundreds of college students each day. We see that Barney's Beanery, a very popular bar for UCLA students, is listed as one of the establishments with the most violations in this area and decide to examine the types of violations they were given.

In [288]:
facilities = "BARNEY'S BEANERY"
barneys = westwood.loc[westwood.facility_name == facilities]
barneys.violation_description.unique()
Out[288]:
array(['# 29. Toxic substances properly identified, stored, used',
       '# 07. Proper hot and cold holding temperatures',
       '# 14. Food contact surfaces: clean and sanitized',
       '# 44. Floors, walls and ceilings: properly built, maintained in good repair and clean',
       '# 40. Plumbing: Plumbing in good repair, proper backflow devices',
       '# 37. Adequate ventilation and lighting; designated areas, use',
       '# 36. Equipment, utensils and linens: storage and use',
       '# 35. Equipment/Utensils - approved; installed; clean; good repair, capacity',
       '# 33. Nonfood-contact surfaces clean and in good repair',
       '# 30. Food properly stored; food storage containers identified',
       '# 27. Food separated and protected',
       '# 09. Proper cooling methods',
       '# 39. Wiping cloths: properly used and stored',
       '# 42. Toilet facilities: properly constructed, supplied, cleaned',
       '# 43. Premises; personal/cleaning items; vermin-proofing',
       '# 06. Adequate handwashing facilities supplied & accessible'],
      dtype=object)

A full reference guide for these violations can be found on http://publichealth.lacounty.gov/eh/docs/WhatsNew/FOIR-RefGuide.pdf. Unfortunately, our dataset does not list the specific reason for the violation under each code so we cannot examine this more closely. From this we move on to the rest of Los Angeles and examine the breakdown of inspections by establishment type.

Proportion of violations summarized by establishment type and grade

In [289]:
table = pd.pivot_table(data, values='record_id', index=['type'],
                    columns=['grade'], aggfunc='count').fillna(0)

total = data.groupby(['type'])['record_id'].count()
table['total'] = total
table = table.divide(total, axis=0)
table['total'] = total
table = table.round(2)
table.sort_values(by = 'total', ascending = False)
Out[289]:
grade A B C total
type
Mexican 0.90 0.09 0.01 1830
Chinese 0.76 0.20 0.04 1009
Coffee & Tea 0.91 0.09 0.00 993
Korean 0.79 0.19 0.02 937
Bakeries 0.80 0.20 0.00 869
Grocery 0.95 0.05 0.00 854
Pizza 0.89 0.09 0.02 786
Japanese 0.92 0.08 0.00 725
Thai 0.88 0.09 0.03 600
American (New) 0.93 0.07 0.00 596
Breakfast & Brunch 0.86 0.12 0.03 475
Italian 0.97 0.03 0.00 470
Burgers 0.84 0.12 0.04 425
Sushi Bars 0.91 0.09 0.00 398
Fast Food 0.96 0.04 0.00 375
Donuts 0.82 0.18 0.00 346
Sandwiches 1.00 0.00 0.00 331
American (Traditional) 0.91 0.09 0.00 315
Cafes 0.95 0.05 0.00 304
Vietnamese 0.78 0.18 0.04 280
Mediterranean 1.00 0.00 0.00 275
Indian 0.95 0.00 0.05 250
Juice Bars & Smoothies 0.94 0.06 0.00 231
Seafood 0.83 0.12 0.05 230
Ramen 0.81 0.19 0.00 229
Chicken Wings 0.95 0.05 0.00 210
Beer, Wine & Spirits 0.97 0.03 0.00 205
Bars 0.87 0.13 0.00 190
Desserts 1.00 0.00 0.00 190
Salvadoran 0.92 0.08 0.00 166
... ... ... ... ...
Skating Rinks 1.00 0.00 0.00 10
Shanghainese 1.00 0.00 0.00 9
Taiwanese 1.00 0.00 0.00 9
Moroccan 1.00 0.00 0.00 8
Venues & Event Spaces 1.00 0.00 0.00 8
Gelato 1.00 0.00 0.00 8
Conveyor Belt Sushi 1.00 0.00 0.00 8
Live/Raw Food 1.00 0.00 0.00 8
Armenian 1.00 0.00 0.00 7
Comfort Food 1.00 0.00 0.00 6
Pool Halls 1.00 0.00 0.00 6
Modern European 1.00 0.00 0.00 6
Patisserie/Cake Shop 1.00 0.00 0.00 6
Home Decor 1.00 0.00 0.00 5
Office Equipment 1.00 0.00 0.00 5
Cuban 1.00 0.00 0.00 5
Food Stands 1.00 0.00 0.00 5
Bookstores 1.00 0.00 0.00 5
Cheesesteaks 1.00 0.00 0.00 5
Custom Cakes 1.00 0.00 0.00 4
Food Trucks 1.00 0.00 0.00 4
Performing Arts 1.00 0.00 0.00 4
Scandinavian 1.00 0.00 0.00 4
Day Spas 1.00 0.00 0.00 4
Piano Bars 1.00 0.00 0.00 3
Halal 1.00 0.00 0.00 3
Men's Clothing 1.00 0.00 0.00 3
Coffee Roasteries 1.00 0.00 0.00 2
Vitamins & Supplements 1.00 0.00 0.00 2
British 1.00 0.00 0.00 1

162 rows × 4 columns

From a preliminary look, we see that Mexican, Chinese, Coffee & Tea, Korean, and Bakery establishments had the most number of individual violations in 2018. Let us examine this further below by examining the proportion of grades by establishment type.

Proportion of grades by establishment type

Top 20 'B' violations

In [290]:
table.sort_values(by = 'B', ascending = False).head(20)
Out[290]:
grade A B C total
type
Colombian 0.00 1.00 0.00 11
Cantonese 0.36 0.64 0.00 58
Strip Clubs 0.50 0.50 0.00 14
Ethiopian 0.64 0.36 0.00 50
Bowling 0.64 0.36 0.00 33
Cajun/Creole 0.64 0.36 0.00 59
Hot Dogs 0.73 0.27 0.00 48
Dive Bars 0.76 0.24 0.00 29
Indonesian 0.76 0.24 0.00 29
Himalayan/Nepalese 0.77 0.23 0.00 30
Pretzels 0.78 0.22 0.00 18
Greek 0.78 0.22 0.00 90
Chinese 0.76 0.20 0.04 1009
Bakeries 0.80 0.20 0.00 869
Filipino 0.80 0.20 0.00 69
Korean 0.79 0.19 0.02 937
Ramen 0.81 0.19 0.00 229
Donuts 0.82 0.18 0.00 346
Vietnamese 0.78 0.18 0.04 280
Stadiums & Arenas 0.82 0.18 0.00 39

Here we see the top 20 establishment types that had the highest proportion of violations resulting in a B grade. 100% of violations for Columbian establishments resulted in a B rating, but only there were only 11 violations for those establishments in 2018. We note that Chinese establishments had a relatively high number of violations handed out to them (1009 violations) and 20% of those resulted in a B rating. Bakeries and Korean establishments also had relatively higher amounts of violations handed out to them (869 and 937 violations respectively), each with 20% and 19% of those violations resulting in B ratings.

Top 20 'C' violations

In [291]:
table.sort_values(by = 'C', ascending = False).head(20)
Out[291]:
grade A B C total
type
Buffets 0.50 0.17 0.33 48
Soul Food 0.84 0.00 0.16 58
Meat Shops 0.80 0.08 0.12 104
Seafood 0.83 0.12 0.05 230
Indian 0.95 0.00 0.05 250
Chinese 0.76 0.20 0.04 1009
Burgers 0.84 0.12 0.04 425
Vietnamese 0.78 0.18 0.04 280
Thai 0.88 0.09 0.03 600
Breakfast & Brunch 0.86 0.12 0.03 475
Pizza 0.89 0.09 0.02 786
Korean 0.79 0.19 0.02 937
Mexican 0.90 0.09 0.01 1830
Middle Eastern 1.00 0.00 0.00 94
Modern European 1.00 0.00 0.00 6
Men's Clothing 1.00 0.00 0.00 3
Mediterranean 1.00 0.00 0.00 275
Acai Bowls 1.00 0.00 0.00 20
Moroccan 1.00 0.00 0.00 8
Music Venues 1.00 0.00 0.00 38

Here we see the top 20 establishment types that had the highest proportion of violations resulting in a C grade. From the top, we see that Buffet type establishments had 33% of their violations result in a C grade. However, only 48 violations in total were handed out to them in total. It is interesting to note that 'Buffet' type establishments have a relatively higher proportion of violations in the 'B' and 'C' ranking range as seen in both tables. After getting a preliminary sense of the violation points and the amount of violations by establishment type, we wish to gain a better understanding of the health score by establishment type. This analysis is conducted below by examining the average health score by establishment type.

What is the average health score by establishment type?

In [292]:
data.groupby(['type'])['score'].mean().sort_values(ascending=False).head(10)
Out[292]:
type
British              99.000000
Coffee Roasteries    98.000000
Piano Bars           97.000000
Halal                97.000000
Performing Arts      96.750000
Drugstores           96.642857
Armenian             95.857143
Discount Store       95.750000
Cheese Shops         95.681818
Department Stores    95.608696
Name: score, dtype: float64
In [293]:
data.groupby(['type'])['score'].mean().sort_values(ascending=True).head(10)
Out[293]:
type
Buffets         82.979167
Colombian       83.000000
Cantonese       86.827586
Strip Clubs     87.500000
Bowling         87.636364
Cajun/Creole    89.288136
Chinese         89.419227
Vietnamese      89.867857
Ethiopian       89.900000
Indonesian      89.931034
Name: score, dtype: float64

From the above analysis, we see the top and bottom establishments by average health score. We notice again that Buffet type establishments have the lowest average health score rating and Chinese establishments appear in the bottom 10 list again as well. We have looked at individual violations, violation scores, and health score averages by establishment type. These are all important factors to keep into mind when choosing a place to go out to, but arguably one of the most important factors individuals take into account when going out is the Yelp rating of these establishments. Below we start our analysis on this highly coveted variable.

Distribution of violations and grades by rating

In [294]:
table = pd.pivot_table(data, values='record_id', index=['rating'],
                    columns=['grade'], aggfunc='count').fillna(0)
total = data.groupby(['rating'])['record_id'].count()
table['total'] = total

table = table.divide(total, axis=0)
table = table.drop('total', axis=1)
table = table.round(2)

total = data.groupby(['rating'])['record_id'].count()
table['total'] = total

table
Out[294]:
grade A B C total
rating
1.0 1.00 0.00 0.00 29
1.5 0.97 0.03 0.00 206
2.0 0.88 0.11 0.01 463
2.5 0.84 0.16 0.00 845
3.0 0.85 0.12 0.03 2271
3.5 0.90 0.09 0.01 5133
4.0 0.91 0.08 0.01 6891
4.5 0.91 0.09 0.00 3591
5.0 0.81 0.19 0.00 359

We see that 1 star restaurants had the highest proportion of A rating inspections, but it also had the least amount of inspections total. Establishments between the 2 to 3 star rating zone contained a moderate amount of B rating inspections compared to other rating categories. A surprising proportion of 5 star restaurants (19%) had B rating inspections but it is also accompanied by a lower sample size. The proportion of C rating inspections remains low for all rating categories.

Distribution of violations and grades by Zipcode

In [295]:
table_zip = pd.pivot_table(data, values='record_id', index=['facility_zip'],
                    columns=['grade'], aggfunc='count').fillna(0)
total_zip = data.groupby(['facility_zip'])['record_id'].count()
table_zip['total_zip'] = total_zip

table_zip = table_zip.divide(total_zip, axis=0)
table_zip = table_zip.drop('total_zip', axis=1)
table_zip = table_zip.round(2)

table_zip['total_violations'] = total_zip
table_zip.sort_values("total_violations", ascending = False).head(10)
Out[295]:
grade A B C total_violations
facility_zip
90012 0.86 0.14 0.00 832
90048 1.00 0.00 0.00 788
90020 0.83 0.15 0.02 781
90005 0.76 0.21 0.02 759
90028 0.91 0.06 0.02 751
90027 0.98 0.02 0.00 751
90024 0.85 0.15 0.00 738
90045 0.80 0.19 0.02 725
90036 0.94 0.06 0.00 674
90026 0.95 0.04 0.01 636

We take another look at the zipcodes with the highest number of violations except this time we break it down by grade type. 90012, the zipcode for Chinatown and Japantown, had the most number of individual violations with 86% of those resulting in an A rating and 14% of those resulting in a B rating. Interestingly, 90048, the zipcode neighboring Beverly Hills and West Hollywood had the second highest number of violations but 100% of those violations resulted in an A rating for those establishments.

Rodents?

One of the least desirable health violations is that involving rodents or insects. Violation code F023 encompasses this category. A full description is as follows: This subcategory is marked MAJOR if any of the following violations are observed:

  • A rodent infestation as evidenced by any of the following: fresh droppings, gnaw marks, nesting materials, grease marks, live or dead rodents in the food preparation, food storage, warewashing areas, indoor customer dining area, or in the restroom. (CLOSURE)
  • A cockroach infestation as evidenced by one or more live cockroaches in the food preparation, food storage, warewashing, indoor customer dining area, or restrooms. (CLOSURE)
  • Fly infestation as evidenced by numerous flies in food preparation areas contacting food, food-contact surfaces or both. (CLOSURE)
  • A food employee caring for or handling animal(s) that is present, such as patrol dogs, service animals, or pets, in areas that are used for food preparation, storage, or display.

As listed in the background information, a violation of 2 points is counted as a minor violation. A violation of more than 4 points is marked as major. Here we wish to see which establishments had a major violation in this category.

In [296]:
data.loc[(data['violation_code'] == 'F023') & (data['points'] > 4)].shape[0]
Out[296]:
53

53 violations in 2018 were for a major violation for F023. Let's examine this more closely.

In [297]:
rodents = data.loc[(data['violation_code'] == 'F023') & (data['points'] > 4)]
rodents.groupby(['type'])['record_id'].count().sort_values(ascending=False)
Out[297]:
type
Mexican                      6
Thai                         4
Korean                       3
Burgers                      2
Chinese                      2
Meat Shops                   2
Pizza                        2
Grocery                      2
Breakfast & Brunch           2
Seafood                      2
Bakeries                     2
Cajun/Creole                 1
Greek                        1
Beer, Wine & Spirits         1
Coffee & Tea                 1
Dive Bars                    1
Donuts                       1
Asian Fusion                 1
Ethiopian                    1
Filipino                     1
Vietnamese                   1
Indian                       1
Himalayan/Nepalese           1
Ice Cream & Frozen Yogurt    1
Indonesian                   1
Japanese                     1
Juice Bars & Smoothies       1
Lounges                      1
Persian/Iranian              1
Poke                         1
Pretzels                     1
Salvadoran                   1
Soul Food                    1
Sushi Bars                   1
American (New)               1
Name: record_id, dtype: int64

We see that 6 major rodent violations were handed out to Mexican establishments in 2018. The next establishments to receive these violations were Thai, Korean, and Pizza places. Next we compile a list of these establishments

In [298]:
rodents.groupby(['type', 'facility_name'])['record_id'].count().sort_values(ascending=False)
Out[298]:
type                       facility_name                      
Meat Shops                 MEATLAND CARNICERIA                    2
Seafood                    MARISCOS EL JATO                       2
Vietnamese                 BLOSSOM RESTAURANT                     1
Coffee & Tea               NATHAN'S KOSHERLAND                    1
Ice Cream & Frozen Yogurt  NEVERIA EL RINCON DE ROSS              1
Himalayan/Nepalese         HIMALAYAN HUT                          1
Grocery                    SEAFOOD RANCH MARKET                   1
                           GALVEZ MARKET                          1
Greek                      ALIKI'S GREEK TAVERNA                  1
Filipino                   KUSINA FILIPINA                        1
Ethiopian                  AWASH RESTAURANT                       1
Donuts                     CALIFORNIA DONUTS                      1
Dive Bars                  KING EDDY SALOON                       1
Chinese                    TASTY WOK CUISINE                      1
Indonesian                 RAMAYANI                               1
Chinese                    ORIENTAL HOUSE RESTAURANT              1
Cajun/Creole               DELICIOUS SOUTHERN CUISINE             1
Burgers                    THE OINKSTER                           1
                           JUICY WINGZ                            1
Breakfast & Brunch         SALT'S CURE                            1
                           JEWEL                                  1
Beer, Wine & Spirits       BROCKTON LIQUOR                        1
Bakeries                   SEOUL BAKERY                           1
                           CONTINENTAL BAKERY                     1
Asian Fusion               FIST OF FUSION                         1
Indian                     AMAR DESH INDIAN RESTAURANT            1
Juice Bars & Smoothies     EARTH ORGANIC JUICE BAR                1
Japanese                   OKI JAPAN                              1
Thai                       WAT DONG MOON LEK NOODLE               1
                           PINK PEPPER                            1
                           LOVE SHRIMP THAI THAI                  1
                           IT'S THAI HOMETAURANT                  1
Sushi Bars                 SUSHI MASU                             1
Soul Food                  ROSCOE'S HOUSE OF CHICKEN & WAFFLES    1
Salvadoran                 LA PUPUSA LOCA                         1
Pretzels                   AUNTIE ANNE'S                          1
Poke                       KOPE POKI                              1
Pizza                      KING COLE PIZZA                        1
                           CICCERO'S PIZZA                        1
Persian/Iranian            FARSI CAFE                             1
Mexican                    TACOS LOS CARNALES                     1
                           RESTAURANTE Y PUPUSERIA LA LIBERTAD    1
                           LA FLOR DE SAHUAYO                     1
                           GOLFO DE FONSECA RESTAURANT            1
                           CHILE VERDE                            1
                           BIRRIERIA NOCHISTLAN                   1
Lounges                    KARMA LOUNGE                           1
Korean                     KOGINARA                               1
                           GOOPOJIB                               1
                           CHIMAC STAR                            1
American (New)             NAPA VALLEY GRILLE                     1
Name: record_id, dtype: int64

3) Significant Difference between Mexican and Chinese Violations?

We wish to examine if there is a significant difference in the proportion of violations for Mexican and Chinese Restaurants in terms of A's, B's, and C's. We will use the two proportion z-test to evaluate our findings.

In [299]:
table = pd.pivot_table(data, values='record_id', index=['type'],
                    columns=['grade'], aggfunc='count').fillna(0)

total = data.groupby(['type'])['record_id'].count()
table['total_violations'] = total

establishment_count = data.groupby(['type'])['facility_name'].nunique()
table['establishment_count'] = establishment_count

table.sort_values(by = 'total_violations', ascending = False).head(2)
Out[299]:
grade A B C total_violations establishment_count
type
Mexican 1655.0 158.0 17.0 1830 218
Chinese 767.0 197.0 45.0 1009 80

A - Violations

In [300]:
counts = np.array([1655, 767])
nobs = np.array([1830, 1009])

zscore, pvalue = proportions_ztest(counts, nobs, alternative = 'two-sided')
print('zscore = {:.3f}, pvalue = {:.3f}'.format(zscore, pvalue))
zscore = 10.390, pvalue = 0.000

Our results indicate that there is a significant difference in the proportion of Mexican violations resulting in an A (90.0%) and the proportion of Chinese violations resulting in an A (76.0%). The proportion of Mexican violations resulting in an A is significantly higher.

B - Violations

In [301]:
counts = np.array([158, 197])
nobs = np.array([1830, 1009])

zscore, pvalue = proportions_ztest(counts, nobs, alternative = 'two-sided')
print('zscore = {:.3f}, pvalue = {:.3f}'.format(zscore, pvalue))
zscore = -8.397, pvalue = 0.000

Here, our results indicate that there is a significant difference in the proportion of Mexican violations resulting in a B (8.6%) and the proportion of Chinese violations resulting in a B (19.5%). The proportion of Chinese violations resulting in a B is significantly higher.

C - Violations

In [302]:
counts = np.array([17, 45])
nobs = np.array([1830, 1009])

zscore, pvalue = proportions_ztest(counts, nobs, alternative = 'two-sided')
print('zscore = {:.3f}, pvalue = {:.3f}'.format(zscore, pvalue))
zscore = -6.161, pvalue = 0.000

Lastly, our results indicate that there is a significant difference in the proportion of Mexican violations resulting in a C (1.0%) and the proportion of Chinese violations resulting in a C (4.5%). The proportion of Chinese violations resulting in a C is significantly higher.

It is interesting to note that there are almost 3 times as many Mexican establishments as Chinese establishments, but Chinese establishments have significantly higher proportions of B and C violations and a significantly lower proportion of A violations. Mexican establishments may have an overall higher amount of violations in general, but the majority of those violations (90%) resulted in an A grade.