I referred to the following websites.
I used the following libraries.
# SQL
import sqlite3 as sql
# YOUR FAVORITE DATA SCIENCE LIBRARIES
import pandas as pd
import numpy as np
# DISPLAY PLOTS INLINE
%matplotlib inline
# MAKE 'EM PLOTS BIG
import seaborn as sns
import matplotlib.pyplot as plt
plt.rcParams['figure.figsize'] = (12, 12)
# MAPS
from mpl_toolkits.basemap import Basemap
import geopandas as gpd
import shapely.geometry as geom
In the following exercise, I will be extracting information about the food trucks from the SQL database.
# LOAD IN DATA BASE
db = sql.connect("sf_data.sqlite")
# QUERY TOP 10
top10 = pd.read_sql("SELECT Applicant, count(Applicant) AS num_permits FROM mobile_food_permits AS mfp "
"INNER JOIN (SELECT DISTINCT locationid, permit FROM mobile_food_schedule) "
"AS mfs ON mfp.permit=mfs.permit WHERE mfp.Status=='APPROVED' GROUP BY mfp.Applicant "
"ORDER By num_permits DESC LIMIT 10", db)
top10
Park's Catering sells in the most areas in the City, holding 23 permits for 23 locations. D&T Catering is second with 17 locations, and F&C Catering is third with 13 locations. In the printed output above, we see the rest of the top 10.
# INITIALIZE GRAPH OBJECTS
objects = list(top10.Applicant.values)
objects_abb = [name[:10] for name in objects]
y_pos = np.arange(len(objects))
# PLOTTING DATA
plt.bar(y_pos, top10["num_permits"][:15], color="orange")
# #Set up the graphs
plt.xticks(y_pos, objects_abb, rotation = "horizontal")
plt.ylabel("Number of Unique Locations")
plt.title("Number of Locations of Food Trucks in the City")
plt.show()
As I progress through Exercise 1.2, I will be analyzing some topics that I find most relevant, having been born and raised in the City.
(1) What are the most and least expensive parts of the City?
(2) What are the most common types of crime? Where do crimes happen most?
(3) Where do homicides happen in the City?
(4) What is the relationship between elementary and high schools in the City?
(5) What is the distribution of high schools in the City?
I will first select the median prices per square foot of homes and the zip codes of the corresponding region.
Question 1 Contents
(a) Most Expensive Zipcode Regions of the City
(b) Least Expensive Zipcode Regions of the City
(c) Map Visualization of Median Home Price per Zipcode Region
prices_per_zip = pd.read_sql("SELECT RegionName, MedianSoldPricePerSqft_AllHomes FROM zillow WHERE MedianSoldPricePerSqft_AllHomes>0 GROUP BY RegionName", db)
prices_per_zip = prices_per_zip.sort_values(by="MedianSoldPricePerSqft_AllHomes", ascending=False)
top3_bottom3 = prices_per_zip.iloc[[0, 1, 2, len(prices_per_zip)-3, len(prices_per_zip)-2, len(prices_per_zip)-1]]
top3_bottom3
94104: Downtown, the Financial District
94111: Embaracdero
94133: North Beach
As a San Francisco native, I am no alien to the bustle and lure of the downtown area. The Financial District is the center of business in downtown, decorated with the tallest skyscrapers in the city. As expected, the type of housing in this area is loft and apartment style in tall, glamorous buildings suitable for the monied professionals that work in the area.
The Embaracdero, one of the first neighborhoods in sight as you enter the city through the Bay Bridge, is one of the city's most popular tourist areas, boasting a beautiful view of the newly built Bay Bridge, a concert grounds, and museums along the walk of the coast. The apartments in that area also high rise with a timeless feel which are unsurprisingly expensive.
North Beach, not far off from the two neighborhoods previously mentioned packs in the same San Franciscan integrity, but moreso in large concrete homes than in multistory apartment buildings. Easy access to downtown facilities and the homes being large would definitely make the area more expensive.
94134: Visitacion Valley
94124: Hunter's Point
94158: Mission Bay
Visitacion Valley "the Viz" is the neighborhood in which I was born and raised. The area is known for violence and unkept streets. Growing up, my mom frequently had to clean grafitti off our walls (I do think that San Franciscan graffiti is a charm, it was not pretty on our house), and we were not unaccustomed to car break ins and crime in the area. My father was the manager in the Walgreens in the Viz, and he was riddled with catching shoplifters on a daily basis. Despite the homey look and feel of the Viz, the amount of crime makes it unappealing to the bougie newcomers.
Hunter's Point and Mission Bay are both known for their low-income housing. However, due to the gentrification (or redevelopment, if you're unfortunately on the wrong political side of this issue), prices on homes even in these areas are pushing more and more San Francisco natives out of the city.
# LOAD IN ZIP CODES
zips = gpd.read_file("zip/cb_2015_us_zcta510_500k.shp")
zips.head()
# NARROW IT DOWN SOME
zips = zips[zips.ZCTA5CE10.str.startswith("9")]
# WE ONLY WANT SF ZIPCODES
sf_zips = [str(x) for x in prices_per_zip["RegionName"].values]
# GET ONLY THE ZIPCODES WE WANT
zips = zips.loc[zips["ZCTA5CE10"].isin(set(sf_zips))]
# ADD ZIP CODES TO OTHER DATAFRAME TO ALIGN THEM
zips["RegionName"] = zips["ZCTA5CE10"]
zips = zips.reset_index().sort_values(by="RegionName")
# ONLY THE COLUMNS WE WANT
zips = zips[["RegionName", "geometry"]].reset_index(drop=True)
# SORT IN ORDER TO MERGE DATA FRAMES INTO ONE
prices_per_zip = prices_per_zip.reset_index(drop=True).sort_values(by="RegionName")
# CREATE DATA FRAME WITH GEOMETRY, REGION, AND PRICE
df_col_merged = pd.concat([zips, prices_per_zip["MedianSoldPricePerSqft_AllHomes"]], axis=1)
#REMOVE SOUTH CITY (SOUTH SAN FRANCISCO IS NOT IN SAN FRANCISCO!)
df_col_merged = df_col_merged[df_col_merged["RegionName"] != "94080"].reset_index(drop=True)
df_col_merged.head()
df_col_merged.plot(column='MedianSoldPricePerSqft_AllHomes', cmap="RdPu")
plt.title("Median Home Price Per Zip Code")
The most expensive zip code areas above are the darkest shade of purple. The less expensive areas are cream and pale pink.
The median value of median home price per zipcode region in San Francisco is 601.74k in according to this dataset. Below is a histogram of the median home prices per region. The horizontal line indicates the median home price of 601.74k. We see that there are still many zip codes below the median price, however there are many regions with drastically higher median home prices. Note that this is obviously a relatively high median price to other areas in California, such as Sacramento.
np.median(prices_per_zip["MedianSoldPricePerSqft_AllHomes"])
median_prices = sns.barplot(x="RegionName", y="MedianSoldPricePerSqft_AllHomes", data=prices_per_zip,
label="Total", color="lightblue")
for item in median_prices.get_xticklabels():
item.set_rotation(45)
plt.axhline(y=np.median(prices_per_zip["MedianSoldPricePerSqft_AllHomes"]), linewidth=5, color="black", alpha=0.3)
plt.title("Median Price for Homes")
plt.show()
Question 2 Contents
(a) In which Police Districts do crimes happen most?
(b) What crimes are happening in this district?
(c) What are the most San Franciscan types of crime, i.e. overall most common crimes?
# INITIALIZE DATA FRAME
crime = pd.read_sql("SELECT Descript, PdDistrict FROM crime", db)
crime.head()
# DATAFRAME OF DISTRICTS ONLY
districts = pd.DataFrame(crime["PdDistrict"].value_counts())
# TOTAL CRIMES PER DISTRICT
sns.barplot(x=districts.index.values, y=districts["PdDistrict"], data=districts,
label="Total", color="orange")
plt.title("Total Crimes Per Police District")
plt.show()
<>
The Southern Police District has the most crimes.
As we can see from the map above, the Southern Police District refers to South of Market as well as Treasure Island. While SoMa is home to a lot of company offices, it is also home to some tougher communities. Once driving through SoMa, I saw a dead man in the middle of the street, all his neighbors out upon the sidewalk, having witnessed the scene. Unfortunately, I drove right to the right of this dead man because there were no Crime Scene tapes up yet. This happened to me as I was dropping off my sister to her office at Square, a tech company. I think this may give you a good idea of the SoMa climate.
Treasure Island is a less inhabited part of San Francisco where crime is probably easier to commit.
We'll take a look by day of the week for the first 20 crimes in the dataframe for Southern Police District, to keep our analyses less computationally expensive.
southern_crime = pd.read_sql("SELECT Descript, DayOfWeek FROM crime WHERE PdDistrict='SOUTHERN'", db)
southern_crime.head()
# GRAB FIRST 20 CRIMES
first20crimes = southern_crime["Descript"].values[0:20]
dow_counts = [southern_crime["DayOfWeek"][southern_crime["Descript"]==x].value_counts() for x in first20crimes]
# GET A DATAFRAME SUMMARY
southern_dow_summary = pd.DataFrame(dow_counts, top20socrimes).drop_duplicates()
southern_dow_summary
# CALCULATE MOST / LEAST
most_dow = southern_dow_summary.idxmax(axis=1)
least_dow = southern_dow_summary.idxmin(axis=1)
# MOST, LEAST, AND ON AVERAGE DATAFRAME
southern_dow_summary_2 = pd.concat([most_dow.to_frame(), least_dow.to_frame(), southern_dow_summary.mean(axis=1).to_frame()], axis=1)
southern_dow_summary_2.columns = ["Committed Most On", "Committed Least On", "Average Committed Per Day"]
southern_dow_summary_2.sort_values(by="Average Committed Per Day", ascending=False)
The above dataframe summarizes 20 crimes in the Southern Police District. In this district, we see that the first 20 crimes of the dataset happen more on weekends: Fridays, Saturdays, and Sundays. It is interesting to see that posession of marijuana, mentally disturbed cases, and driving mishaps happen less on Sundays, the day that many people worship God in the city.
# DATA FRAME OF CRIME TYPES
crime_types = pd.DataFrame(crime["Descript"].value_counts())
crime_types = sns.barplot(x=crime_types.index.values[0:10], y=crime_types["Descript"][0:10][0:10], data=crime_types,
label="Total", color="lightblue")
for item in crime_types.get_xticklabels():
item.set_rotation(90)
plt.title("Top 10 Most Frequently Committed Crimes")
plt.show()
The most common type of crime is Grand Theft from a Locked Automobile. Other common crimes have to do with lost property, bad driving, and being mentally ill.
One crime that is particularly of interest in any city is murder. Is it more likely that you may get killed walking down the street in any certain area in San Francisco rather than others? The answer is yes.
# READ IN HOMICIDE DATA
homicides = pd.read_sql("SELECT Descript, PdDistrict, Lon, Lat FROM crime WHERE Descript LIKE '%HOMICIDE%'", db)
homicides.head()
The total number of homicides observed in this dataset is 677.
print "Homicides in this dataset: ", homicides.shape[0]
# INITIALIZE BASEMAP
homicide_map = Basemap(projection="merc", llcrnrlon=-122.6, llcrnrlat=37.65, urcrnrlon=-122.3, urcrnrlat=37.85, resolution="h", area_thresh = 0.1)
# DRAW IN LINES AND SHAPEFILE
homicide_map.drawcoastlines()
homicide_map.drawmapboundary()
homicide_map.drawcounties()
homicide_map.readshapefile("SF Find Neighborhoods/geo_export_dd4c4138-2501-498e-86f4-06d7827b2d53", "geo_export_dd4c4138-2501-498e-86f4-06d7827b2d53")
# RECOLOR
homicide_map.fillcontinents(color="darkgrey")
# ADD SCATTER
for lon, lat, desc in zip(homicides.Lon, homicides.Lat, homicides.Descript):
x, y = homicide_map(lon, lat)
# DANGEROUS WEAPON
color = "c"
# GUN
if desc == "ATTEMPTED HOMICIDE WITH A GUN":
color = "r"
# KNIFE
if desc == "ATTEMPTED HOMICIDE WITH A KNIFE":
color = "y"
# BODILY FORCE
if desc == "ATTEMPTED HOMICIDE WITH BODILY FORCE":
color = "b"
homicide_map.plot(x, y, color + "o", markersize = 5, alpha = 0.7)
plt.title("San Francisco Homicides")
plt.show()
Red - Homicide by Gun | Cyan - Homicide by Dangerous Weapon | Yellow - Homicide by Knife | Blue - Homicide by Bodily Force
Homicides are more frequent on the Eastern sides of San Francisco. According to this map, homicides happen more frequently in the Eastern Areas of San Francisco. This includes downtown areas, Visitacion Valley, and Mission Bay. The areas with less homicides include the Richmond and Sunset (places that are closer to the beach and have plenty of restaurants and homes) and the Lake Merced area that San Francisco State University is at.
Recall that in Question 2, we calculated that the Southern Police District had more crimes than all other districts. In this particular district alone, 93 homicides took place. That is, 13.737% of homicides happened in the Southern Police District, and also notable: none of the homicides happened on Treasure Island.
southern_crime_homicides = pd.read_sql("SELECT Descript FROM crime WHERE PdDistrict='SOUTHERN' and Descript LIKE '%HOMICIDE%'", db)
print "Homicides in Southern PD: ", southern_crime_homicides.shape[0]
print 93*(677**(-1))*100, "% of homicides from this dataset were in the Southern PD"
Let's explore! I will be defining elementary school to be schools for which take students that are in the 8th grade or lower. High schools are schools that have an upper grade of higher than 8th grade. Elementary schools will be colored in cyan blue and the high schools will be in dark blue.
# INITIALIZE DATA FRAME FROM SQL DATABASE
schools = pd.read_sql("SELECT * FROM schools WHERE UpperGrade < 13", db)
# INITIALIZE BASEMAP
school_map = Basemap(projection="merc", llcrnrlon=-122.6, llcrnrlat=37.65, urcrnrlon=-122.3, urcrnrlat=37.85, resolution="h", area_thresh = 0.1)
# DRAW IN LINES AND SHAPEFILE
school_map.drawcoastlines()
school_map.drawmapboundary()
school_map.drawcounties()
school_map.readshapefile("SF Find Neighborhoods/geo_export_dd4c4138-2501-498e-86f4-06d7827b2d53", "geo_export_dd4c4138-2501-498e-86f4-06d7827b2d53")
# RECOLOR
school_map.fillcontinents(color="darkgrey")
# ADD SCATTER
for lon, lat, upper in zip(schools.Lon, schools.Lat, schools.UpperGrade):
x, y = school_map(lon, lat)
# HIGH SCHOOLS
color = "b"
# ELEMENTARY SCHOOLS
if upper <= 8:
color = "c"
school_map.plot(x, y, color + "o", markersize = 5, alpha = 0.8)
plt.title("Elementary and High Schools")
plt.show()
Red - High Schools | Cyan - Elementary Schools
There are obviously many more elementary schools than high schools. But what is less obvious is that there are 6 elementary schools for every 1 high school. This would suggest that enrollment in high schools is far greater than in elementary schools.
Elementary schools are more or less evenly distributed around the city, but high schools are clustered together in certain areas. That is, not all students have an equal lengthed commute after they finish elementary school. Increased classroom sizes in high schools could be an educational disadvantage to students in San Francisco, which is awfully cruel considering the home prices in the area.
# NUMBER OF ELEMENTARY SCHOOLS DIVIDED BY NUMBER OF HIGH SCHOOLS
schools[schools["UpperGrade"]<=8].shape[0] * ((schools[schools["UpperGrade"]>8].shape[0])**(-1))
Below, I will plot high schools based on category of high school. High schools are defined to have a "lower grade" of 9.
highschools = pd.read_sql("SELECT * FROM schools WHERE LowerGrade=9 ORDER BY Category", db)
# INITIALIZE BASEMAP
hs_map = Basemap(projection="merc", llcrnrlon=-122.6, llcrnrlat=37.65, urcrnrlon=-122.3, urcrnrlat=37.85, resolution="h", area_thresh = 0.1)
# DRAW IN LINES AND SHAPEFILE
hs_map.drawcoastlines()
hs_map.drawmapboundary()
hs_map.drawcounties()
hs_map.readshapefile("SF Find Neighborhoods/geo_export_dd4c4138-2501-498e-86f4-06d7827b2d53", "geo_export_dd4c4138-2501-498e-86f4-06d7827b2d53")
# RECOLOR
hs_map.fillcontinents(color="darkgrey")
# ADD SCATTER
for lon, lat, category in zip(highschools.Lon, highschools.Lat, highschools.Category):
x, y = hs_map(lon, lat)
color = "k"
# YELLOW PUBLIC SCHOOLS
if category == "USD Grades 9-12":
color = "y"
# CHARTER SCHOOLS
if category == "USD Charter School":
color = "c"
# PRIVATE SCHOOLS
if category == "Independent / Private":
color = "m"
hs_map.plot(x, y, color + "o", markersize = 7, alpha = 0.7)
plt.title("High Schools in San Francisco by Type")
plt.show()
Magenta - Private | Cyan - USD Charter School | Yellow - USD Grades 9-12
The following plot shows high schools plottted upon the map of zip codes, shaded by price of housing. What is apparent in this map is that wealthier areas have more high schools available to the area. In general, wealthier areas do not necessarily have higher high schooler populations than do areas of less wealth in San Francisco.
highschools_per_zip = df_col_merged.plot(column='MedianSoldPricePerSqft_AllHomes', cmap="Greens")
for lon, lat, category in zip(highschools.Lon, highschools.Lat, highschools.Category):
x, y = (lon, lat)
color = "k"
# PRIVATE SCHOOLS
if category == "Independent / Private":
color = "c"
highschools_per_zip.plot(x, y, color + "o", markersize = 10, alpha=0.7)
plt.title("High Schools Per Zip Code")
bounds = df_col_merged.geometry.bounds
plt.xlim([bounds.minx.min()-0.01, bounds.maxx.max()+0.005])
plt.ylim([bounds.miny.min()-0.01, bounds.maxy.max()+0.01])
Black - Public | Cyan - Private
There are less high schools in less wealthy areas of San Francisco.
This could be because there are a quite uniform number of private and public schools throughout the City, and that private schools only cater to areas that have higher income families.
Some areas completely lack private schools.
We can easily see that private schools are clustered near darker green map regions, meaning that higher income families have access to these schools.
There are 20 public high schools and 16 private high schools in San Francisco. There are only 1.25 public schools for every private school.
I find this statistics unimpressive. Public high schools in San Francisco are overcrowded and underfunded in general. That is, many schools lack basic programs that neighboring high school districts have such as: band programs, sex ed programs, and AP classes. The fact that private schools are established much further away from students who live in less wealthy areas in San Francisco limits their mobility in many different ways, including socioeconomic. However, it may be the case that private schools have been established in historically wealthier areas in order to flourish by donation.
# NUMBER OF HIGH SCHOOLS
total = highschools.shape[0]
public = highschools[highschools["Entity"]=="SFUSD"].shape[0]
print "Total number of high schools: ", total
print "Total number of public schools: ", public
print "Total number of private schools: ", total-public
# PUBLIC DIVIDED BY PRIVATE
public * ((total-public)**(-1))
# CLOSE THE DATABASE
db.close()