As a traveler, it's a great opportunity to have a closer look at the flight data and visualize some interesting facts. The analysis is mainly focused on the United States.
# csv is a package for file reading/IO
import csv
# pandas and numpy are the data analysis libraries
import pandas as pd
import numpy as np
The first 3 files are from OpenFlights. Other data is available here.
As of January 2012, the OpenFlights Airports Database contains 6977 airports spanning the globe, as shown in the map above. Each entry contains the following information:
Airport ID
Unique OpenFlights identifier for this airport.Name
Name of airport. May or may not contain the City name.City
Main city served by airport. May be spelled differently from Name.Country
Country or territory where airport is located.IATA
/FAA 3-letter FAA code, for airports located in Country "United States of America". 3-letter IATA code, for all other airports. Blank if not assigned.ICAO
4-letter ICAO code. Blank if not assigned.Lat
Latitude in decimal degrees, usually to six significant digits. Negative is South, positive is North.Lon
Longitude in decimal degrees, usually to six significant digits. Negative is West, positive is East.Alt
Altitude in feet.Timezone
Hours offset from UTC. Fractional hours are expressed as decimals, eg. India is 5.5.DST
Daylight savings time. One of E (Europe), A (US/Canada), S (South America), O (Australia), Z (New Zealand), N (None) or U (Unknown). See also: Help: TimeTz
database time zone Timezone in "tz" (Olson) format, eg. "America/Los_Angeles".The data is ISO 8859-1 (Latin-1) encoded, with no special characters.
Note: Rules for daylight savings time change from year to year and from country to country. The current data is an approximation for 2009, built on a country level. Most airports in DST-less regions in countries that generally observe DST (eg. AL, HI in the USA, NT, QL in Australia, parts of Canada) are marked incorrectly.
airports_df = pd.read_table("airports.dat.txt", header = None, sep = ",", names = ["airport_id","airport_name","airport_city",
"airport_country","airport_iata","airport_icao",
"lat","lon","alt","timezone","dst","tz"])
airports_df.head()
As of January 2012, the OpenFlights Airlines Database contains 5888 airlines. Each entry contains the following information:
Airline ID
Unique OpenFlights identifier for this airline.Name
Name of the airline.Alias
Alias of the airline. For example, All Nippon Airways is commonly known as "ANA".IATA
2-letter IATA code, if available.ICAO
3-letter ICAO code, if available.Airline Callsign
Airline callsign.Country
Country or territory where airline is incorporated.Active
"Y" if the airline is or has until recently been operational, "N" if it is defunct. This field is not reliable: in particular, major airlines that stopped flying long ago, but have not had their IATA code reassigned (eg. Ansett/AN), will incorrectly show as "Y".The data is ISO 8859-1 (Latin-1) encoded. The special value \N is used for "NULL" to indicate that no value is available, and is understood automatically by MySQL if imported.
airlines_df = pd.read_table("airlines.dat.txt", header = None, sep = ",", names = ["airline_id","airline_name","airline_alias",
"airline_iata","airline_icao","airline_callsign","airline_country","active"])
airlines_df.head()
As of January 2012, the OpenFlights/Airline Route Mapper Route Database contains 59036 routes between 3209 airports on 531 airlines spanning the globe. Each entry contains the following information:
Airline
2-letter (IATA) or 3-letter (ICAO) code of the airline.Airline ID
Unique OpenFlights identifier for airline (see Airline).Source airport
3-letter (IATA) or 4-letter (ICAO) code of the source airport.Source airport ID
Unique OpenFlights identifier for source airport (see Airport)Destination airport
3-letter (IATA) or 4-letter (ICAO) code of the destination airport.Destination airport ID
Unique OpenFlights identifier for destination airport (see Airport)Code_share
"Y" if this flight is a codeshare (that is, not operated by Airline, but another carrier), empty otherwise.Stops
Number of stops on this flight ("0" for direct)Equipment
3-letter codes for plane type(s) generally used on this flight, separated by spacesThe data is ISO 8859-1 (Latin-1) encoded. The special value \N is used for "NULL" to indicate that no value is available, and is understood automatically by MySQL if imported.
routes_df = pd.read_table("routes.dat.txt", header = None, sep = ",", names = ["airline_iata_icao", "airline_id", "origin",
"origin_id", "dest", "dest_id", "code_share",
'steps',"equipment"])
routes_df.head()
The last file, On_time_On_Time_Performance_2016_10.csv
is about flight on-time performance for October 2016. It was downloaded from Transtats and its details are documented in readme.html. Each entry contains the following information:
YEAR
MONTH
DAY_OF_MONTH
AIRLINE_ID
(which we will not use)CARRIER
2-letter (IATA) code of the airlineFL_NUM
ORIGIN
Source airport 3-letter (IATA) DEST
Destination airport 3-letter (IATA) ARR_DELAY_NEW
CANCELLED
import requests
from io import StringIO
remote = requests.get('https://docs.google.com/uc?export=download&id=1PPtjGx8lr_cDUfVa3qwlk1W8yY6hY91n').content
flights_df = pd.read_csv(StringIO(remote.decode('utf-8')))
flights_df = flights_df.iloc[:,:-1]
flights_df.columns = ["year","month","day_of_month","airline_id","carrier",
"fl_num","origin","dest", "arr_delay_new","cancelled"]
flights_df.head()
To get indicents information, we can scrape data from the aircraft_incidents.htm webpage
from bs4 import BeautifulSoup
input_html = "aircraft_incidents.htm"
# Open file I/O
with open(input_html, "r") as ifile:
soup = BeautifulSoup(ifile, 'html.parser')
Now we need to do some further cleaning to both the tabular and text data.
We are going to clean the airlines_df, airports_df, and routes_df DataFrames.
# Given the parameter value `\N` it returns `NaN`,
# otherwise it returns the value of the parameter.
def nullify(cell):
if type(cell) == str:
if cell.find("\\N") > -1:
return np.nan
else:
return cell
else:
return cell
# Replace NaNs with blanks if the column is a string
# Everything should be of a consistent type
def fillna_col(series):
if series.dtype is pd.np.dtype(object):
return series.fillna('')
else:
return series
Next, we’ll need to use two functions to apply nullify
and fillna_col
to our DataFrames.
We can apply nullify to all of the elements in each of our DataFrames (airports, airlines, etc.) and get rid of all of the “\N”s.
Also, let’s use that to call fillna_col
on the DataFrames -- replacing the NaNs with blank strings if the column is otherwise an object.
Additionally, we want to get rid of rows in routes_df
that have null airline, source, or destination IDs.
#Drop NAs in airlines_df and airports_df
airlines_df = airlines_df.applymap(nullify)
airlines_df = airlines_df.apply(fillna_col, axis=1)
airports_df = airports_df.applymap(nullify)
airports_df = airports_df.apply(fillna_col, axis=1)
#Have a look at the NAs in routes_df
print("The number of missing values for each column:")
for i in routes_df.columns:
print(i, " ", str(routes_df[i].isnull().sum()))
We have not seen NAs in airline, origin and dest info.
routes_df["code_share"].unique()
routes_df["equipment"].unique()
routes_df["code_share"] = routes_df["code_share"].fillna("N")
routes_df["equipment"] = routes_df["equipment"].fillna("Unknown")
#Check the results
if airports_df.isnull().sum().sum() != 0:
raise ValueError('airports_df still has NaNs')
if airlines_df.isnull().sum().sum() != 0:
raise ValueError('airlines_df still has Nans')
if routes_df.isnull().sum().sum() != 0:
raise ValueError('routes_df still has Nans')
To clean the raw text data, we want to get rid of all h3
and li
, with only text in the dataset.
Also, we want to make sure that we only keep the years and the incidents in the dataset, without extra info.
We notice that all year infomation ends with "[edit]", and we can find the month information as well as a "-" in the incident descriptions.
# Clean the text data using above assumptions
months = ["January", "February", "March", "April", "May", "June", "July", "August", "September", "October", "November",
"December"]
incidents_clean = []
def find_month(string):
for month in months:
if string.find(month)>-1:
return True
return False
for info in soup.find_all(["h3","li"]):
if ("[edit]" in info.text):
year = info.text.replace("[edit]", "")
print("Cleaning the incidents data in " + year)
else:
if find_month(info.text) & ("–" in info.text):
incident = year + " " + info.text
incidents_clean.append(incident)
incidents_clean[:5]
incidents_clean[-5:]
We want to create incidents_df
to store the incidents data
from nltk.tokenize import sent_tokenize, word_tokenize
from nltk.corpus import stopwords
stop_words = set(stopwords.words('english'))
steop_words_cap = [word.capitalize() for word in stop_words]
def remove_stop(sentence):
word_tokens = word_tokenize(sentence)
filtered_sentence = " ".join([w for w in word_tokens if (not w in stop_words) and (not w in steop_words_cap)])
return filtered_sentence
def find_airline(raw_list):
airline = []
for word in raw_list:
if word.startswith("Air"):
for i in range(raw_list.index(word),-1,-1):
if raw_list[i] == raw_list[i].capitalize():
airline.insert(0,raw_list[i])
else:
break
return airline
incidents_list = []
for incident in incidents_clean:
incident = remove_stop(incident)
incident_record = []
date_all = incident.split(" – ")[0].split(" ")
year = date_all[0]
month = date_all[1]
day = date_all[2]
if ("Flight" in incident) & (len(year)==4) & (month in months) & (len(day)<2):
incident_record.append(" ".join(date_all))
incident_record.append(year)
incident_record.append(month)
incident_record.append(day)
flight_index = incident.split(" – ")[1].split(" ").index("Flight")
airline_info = incident.split(" – ")[1].split(" ")[:flight_index]
if "," in airline_info:
airline_info = airline_info[airline_info.index(",")+1:]
if ";" in airline_info:
airline_info = airline_info[airline_info.index(";")+1:]
if ":" in airline_info:
airline_info = airline_info[airline_info.index(":")+1:]
if len(airline_info)>5:
airline_info = find_airline(airline_info)
airline_info = " ".join(airline_info)
flight_num = "".join(incident.split(" – ")[1].split(" ")[flight_index+1:flight_index+2]).replace(" ","")
incident_record.append(airline_info)
incident_record.append(flight_num)
incidents_list.append(incident_record)
incidents_df = pd.DataFrame(incidents_list,columns = ["date","year","month","day","airline","fl_num"])
incidents_df.head()
incidents_df["date"] = pd.to_datetime(incidents_df["date"])
Now, we can export data into database to make them "persistent".
import sqlite3
engine = sqlite3.connect('flight_db')
airlines_df.to_sql('airlines', engine, if_exists='replace',index = False)
airports_df.to_sql('airports', engine, if_exists='replace',index = False)
flights_df.to_sql('flights', engine, if_exists='replace',index = False)
routes_df.to_sql('routes', engine, if_exists='replace',index = False)
incidents_df.to_sql("incidents", engine, if_exists = 'replace', index = False)
Now, we want to check if we can read tables from the database.
airlines_df2 = pd.read_sql_query('select * from airlines', engine)
airports_df2 = pd.read_sql_query('select * from airports', engine)
flights_df2 = pd.read_sql_query('select * from flights', engine)
routes_df2 = pd.read_sql_query('select * from routes', engine)
incidents_df2 = pd.read_sql_query('select * from incidents', engine)
airlines_df2.head()
airports_df2.head()
flights_df2.head()
routes_df2.head()
incidents_df2.head()
It shows that all tables are stored in the database and we can read them. Next, we want to do some data wrangling and visualization.
First, we would like to explore the flight route dataset routes_df2
.
Now let's understand more about the really busy airports in routes_df2
. We want to have a look at the number of flights out of each source airport, restricted to those with more than 100 such flights.
busy_origin = routes_df2.groupby("origin").size().reset_index(name="count").query("count>100").sort_values("count", ascending = False)
print("There are",str(len(busy_origin)),"airports with more than 100 flight routes for departure.")
print("The top 10 busy source airports are:", busy_origin[:10]["origin"].unique())
import seaborn as sns
import matplotlib.pyplot as plt
sns.set(style = "white")
ax = sns.barplot(x = "origin", y = "count", data = busy_origin[:10],
ci=None, color = "#2c7fb8",alpha = 0.8).set_title("Top 10 Busy Airports")
plt.xlabel("airports as origin")
plt.ylabel("count")
plt.show()
We also care about the airport connection. So, we want to have a look at pairs of origins and destinations.
route_od = routes_df2.groupby(["origin", "dest"]).size().reset_index(name = "count").sort_values("count", ascending = False)
route_od.head()
route_od["od_set"] = route_od[["origin","dest"]].apply(lambda x: "_".join(list(set(x))), axis=1)
od_summary = route_od.groupby("od_set").agg({"count":"sum"}).reset_index().sort_values("count",ascending = False)
od_summary.columns = ["od_set","total_count"]
od_summary.head()
od_summary[["airport1", "airport2"]] = od_summary["od_set"].str.split("_",expand=True,)
od_summary = od_summary.merge(airports_df2[["airport_iata", "lat", "lon","airport_country"]], how = "inner", left_on = "airport1",right_on = "airport_iata")
od_summary.columns.values[-3:] = ["lat1","lon1","country1"]
od_summary.drop("airport_iata",axis=1,inplace=True)
od_summary.head()
od_summary = od_summary.merge(airports_df2[["airport_iata", "lat", "lon","airport_country"]], how = "inner", left_on = "airport2",right_on = "airport_iata")
od_summary.columns.values[-3:] = ["lat2","lon2","country2"]
od_summary.drop("airport_iata",axis=1,inplace=True)
od_summary.head()
import folium
m = folium.Map(location=[40, 1], zoom_start=2,tiles="CartoDB dark_matter" )
def scale_maxmin(num, maxvalue, minvalue):
scale_num = (num-minvalue)*0.8/(maxvalue-minvalue)
return scale_num
maxcount = od_summary["total_count"].max()
mincount = od_summary["total_count"].min()
for index, row in od_summary.iterrows():
lon1 = row["lon1"]
lat1 = row["lat1"]
lon2 = row["lon2"]
lat2 = row["lat2"]
count = row["total_count"]
color = "#edf8b1"
opacity = scale_maxmin(count, maxcount, mincount)
pt = ([lat1,lon1],[lat2,lon2])
folium.PolyLine(locations = pt,color = color,opacity = opacity).add_to(m)
m