Skip to content

This project was completed as the small scale team project at YCBS 257 Data at Scale class in Professional Development Certificate Program in Data Science and Machine Learning at McGill University, and the project introduced the MapReduce functions for solving the problems with Big Data.

License

Notifications You must be signed in to change notification settings

ahmedopolis/Flight_Distance_Calculation_with_MapReduce

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

50 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

lifecycle

Flight Distance Calculation with MapReduce

This project was completed as the team project at YCBS 257 Data at Scale class in Professional Development Certificate Program in Data Science and Machine Learning at McGill University, and the project introduced the MapReduce functions for solving the problems with Big Data.

Based on the given flight data, our team extracted the necessary data and calculated the flight distances between Beijing and data points, using MapReduce.

How to use this repository

This repository has the following main directories and files:

Directories

  • data: flight data with JSON strings
  • output files: outputs from the mapper and reducer programs, and from the process of creating the CSV output file
  • images: diagrams created for summary of processes and components

Files

  • mapper.ipynb: mapper program in Python
  • reducer.ipynb: reducer program in Python
  • PPT Slides.pdf: the group presentation ppt slides

Workflow Overview

The project was completed in the following five steps:

  • Step 1. Input Dataset: Verify and review the input dataset (i.e. data type, format and structure before processing)
  • Step 2. Build Mapper: Write a mapper program which takes out all flights ids that have the position messages only, the clock, ident and latitude and longitude
  • Step 3. Build Reducer: Write a reducer program which takes the last position of the flight and calculates its distance to Beijing
  • Step 4. Create CSV List: From the reducer output file, produce a CSV list of all flights (ident, id, and distance to Beijing) sorted by closest to furthest to Beijing
  • Step 5. Data Analysis: Analyze the output file with sorted flight distances and summarize the results of the analysis

mapper

reducer

Data Requirements

The mapper program was developed, based on the input dataset of 19,404 JSON strings in a text file. The reducer program was developed from the sorted output file from the mapper program, which was also a text file with JSON strings.

The detail of the input dataset was as follows:

dataset

Distance Calculation

Our team used the following Haversine formula to calculate the distances between Beijing and data points in the given dataset, which were in latitudes and longitudes: formula

The Haversine formula was broken down to three sections and implemented into the reducer program for calculating the distances as follows:

haversine_details

Key Outputs

The mapper program produced the flight data in JSON objects, which were mapped and sorted by key-value pairs (text file: 'sorted_mapped_flight_data.txt'), which was further passed to the reducer program as its input. The reducer program produced 9,747 JSON strings in a text file, which were reduced to 'id', 'ident' and 'distance' fields, where 'distance' was calculated from 'latitude' and 'longitude' data by the Haversine function within the reducer program (text file: 'reduced_flight_data.txt').

The final output was the CSV file (CSV file: 'flight_list_sorted_by_distance.csv') with the flight data sorted by the flight distance in the ascending order, after the reducer output text file with JSON objects was transformed to a Pandas dataframe and the sorted flight list in the dataframe was exported to a CSV file.

Additional data analysis was conducted in Jupyter Notebook (Jupyter Notebook file: 'reducer.ipynb'), and our team had the following results:

summary

About

This project was completed as the small scale team project at YCBS 257 Data at Scale class in Professional Development Certificate Program in Data Science and Machine Learning at McGill University, and the project introduced the MapReduce functions for solving the problems with Big Data.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Jupyter Notebook 100.0%