Skip to content

Latest commit

 

History

History
85 lines (54 loc) · 4.9 KB

13e-airline_flights.asciidoc

File metadata and controls

85 lines (54 loc) · 4.9 KB

Domain Models

Here are the models for Airport, Airline and Flight:

{{ d['code/munging/airline_flights/airport.rb|idio']['airport_model'] }}

Here’s a snippet of the raw data:

{{ d['data/airline_flights/dataexpo_airports-raw-sample.csv|snippet'] }}

And here’s what it looks like, transformed from raw to target model:

{{ d['tmp/airline_flights/dataexpo_airports-parsed-sample.tsv|snippet|wulign'] }}
{{ d['tmp/airline_flights/openflights_airports-parsed-sample.tsv|snippet|wulign'] }}
{{ d['tmp/airline_flights/airport_identifiers-sample.tsv|snippet|wulign'] }}
Airplane Models
link:code/munging/airline_flights/airplane.rb[role=include]
Airline Models
link:code/munging/airline_flights/airline.rb[role=include]
Flight Models
link:code/munging/airline_flights/flight.rb[role=include]

Data Extraction

Most of the extraction is straightforward. For reasons explained below, we have two sources of airline and airplane data, plus a gazette recon

Recovering Time Zone

So far, the airline data is fairly straightforward to import. However, Loki the Trickster rarely stays clear when it comes to adapting datasets across domains. The flight data has local actual/scheduled times, and it has airports, and it has the date — but it has neither the absolute time nor the time zone.

So, you need a map from airports to time zones. Good news: Openflights.org has that data. In fact, it’s more comprehensive and adds some other interesting columns. Bad news: its data is somewhat messier and its identifiers don’t cleanly reconcile against the airline_flights table. So, you need a 'gazette': a unified table listing the IATA, ICAO and FAA id of each airport. Wikipedia has a table indexing airports by IATA (and some, but not all, pairings with ICAO and FAA); and a table indexing airports by ICAO (and some, but not all, pairings with IATA and FAA) — and they mostly agree, but with a couple hundred (about 2% of nearly 10,000 airports) in conflict.

If you’re keeping track: for want of a time zone, we need an airport-TZ map; for want of common identifiers, we need an ICAO-IATA-FAA identifier gazette; for want of clean resolution anywhere we end up reconciling two datasets against two different gazettes and hand-curating the identifiable errors in the mapping [1]. I won’t go into the boring details of reconciling the airports: they’re boring, and detailed in the code (see munging/airline_flights/reconcile*.rb).

But it’s important to share that there is no royal road to clean data. It’s easy to account for the work required to correct the obvious, surface messiness in the data. The common case is that correcting the 2% of outliers, reconciling conflicting assertions, dealing with ill-formatted records or broken encodings, and the rest of the "chimpanzee" work takes more time than anything else involved in semi-structured data extraction. Most importantly, that work is not a programming problem — it requires you to understand obscure details from the source domain not otherwise needed to solve your problem at hand

[2]

note:[Note that I both converted the altitude figure to meters and rounded it to one decimal place.]

Foundational Data

Exemplars & Mountweazels

For exemplar airports, I’ll chose AUS (Austin), SFO (San Francisco), and BWI (Baltimore-Washington) because I’m most familiar with them; YYZ (Toronto), HNL (Honolulu), ANC (Anchorage) and SJU (San Juan, Puerto Rico) for geographic spread, and PHX (Phoenix) because Arizona (like the preceding three) has an odd time zones. That set is reasonably diverse, so there’s no need for a mountweazel.

Helpful Sort

Standard Sample

I also selected a core set of 50 airports that cover most of the top US metropolitan areas (and includes the exemplars above)

sample, so we could construct the following reduced datasets:

  • airports-sampled.tsv

  • flights-sampled.tsv — flights between core airports


1. "Yak Shaving": a recursively unbound descent into the sunk-cost fallacy
2. I now know far more about the pecadilloes of international airport identifier schemes than I ever wished to know. Airports may have an IATA id, an ICAO id, and (in the US and its territories) an FAA id. In the continental US, the ICAO is always the FAA id preceded by a "K": Austin-Bergstrom airport has FAA id AUS, IATA id AUS and ICAO id KAUS. However: * Not all airports have ICAO ids, and not all airports have IATA ids. * sThe FAA id often, but not always, matches the IATA id; this is the primary source of errors in the airport metadata, as people blithely assign the FAA id to an IATA-id-less airport.</li> <li>There is yet another identifier, the METAR id, used to identify the weather station at an airport; it was once the same as the ICAO id but they are now maintained independently.</li></ul> Yet, somehow, all those planes typically land at the right place.