Planning To Work On

Data Management

Between the AVAS and APC datasets, we have a vast pool of information. The AVAS data consists of Bus State History at the timepoint level. We have over five years of data (from 2007 - 2012) totalling 20GB. The APC dataset is at the bus-stop level and one quarter is approximately 16GB (we have data for one whole year). With such large files, data management is highly important.

S3 - Data Storage

S3 is the Amazon Simple Storage Service. It is designed to make web-scale computing easier for the developers. Each object (from 1 byte to 5 terabytes in size) is stored in a bucket and retrieved via a unique, developer-assigned key. The major benefits of S3 include (1) Security, (2) Reliability, (3) Speed, (4) Scalability, and (5) Simplicity, all things we require. Our plan is to set up S3 for our data and use Amazon's EMR for processing.

EMR - Data Processing

EMR is Amazon's Elastic MapReduce. We hope it will allow us to quickly process our vast amount of data. It utilizes a hosted Hadoop framework running on the web-scale infrastructure of Amazon Elastic Compute Cloud (Amazon EC2) and Amazon Simple Storage Service (Amazon S3). One concern addressed below is whether we can utilize MapReduce in settings where there may be correlation between observations in different buckets.

Model Complexity vs. Running Time Tradeoffs

As we make the model more and more complex, we are better able to understand the intricate dependencies that may exist at the added cost of additional computational complexity. We have simulated passenger count data for a month, and we find that on this dataset:

Including a new factor level increases run time by 20 seconds
Choosing the sum-to-zero constraint on factor levels is prohibitively more expensive than the corner constraint (increases run time by ~ 50 seconds)

Statistics

Censored Observations

The APC consists of stop level information about the number of people in, getting on, and getting off the bus (IN, ON, and OFF respectively). We hope to use this information to infer something about the frequency with which people arrive at that stop. As discussed in the Planning Summary, we plan on using Poisson Regression with random effects to estimate the inhomogeneous rate parameters.

One major difficulty is that the observations do not coincide directly with the passenger arrivals. That is, if a bus has 40 passengers IN and a max capacity of 60 then observations must fall in {0,...,20}. Thus, we have censored observations in which the maximum possible observed count is a function of bus capacity and IN. Moreover, if a bus is at capacity then the next bus will likely have a higher than normally observed count. So censorship leads to biased counts in the following buses.

Unfortunately, we cannot simply throw away these observations and work off the uncensored data as the censored observations are during the peak hours which we are most interested in. We are currently working on updating the Poisson Regression framework to handle both of these problems.

Overlapping Buses

Our main input on the supply side are DELTA times at specific timepoints on a route. These correspond to how many minutes you are ahead or behind schedule with respect to a specific location on the route. DELTA = SCHEDULED TIME - ACTUAL TIME, so a positive DELTA implies behind schedule. However, we see a spike in the distribution at DELTA = 10. We have been told this a consequence of buses coming into the terminal and switching to an outward trip. This switch has an inward timepoint and outward timepoint. The thought is that the switch to the outward trip is being delayed to right before the start of the inward trip and thus appears as if the bus is behind schedule. One of our goals is to figure out what is causing this spike around DELTA = 10, and address this in the estimation and simulation steps.

Incorporating correlations and assumptions of Independence and Map-Reduce Methods

Currently we plan on using EMR to segment the data into manageable pieces (called buckets) for analysis. If we analyze the buckets separately, then we are implicitly assuming independence across the different buckets. Therefore, we must be careful at the data segmentation phase to not unknowingly break up observations that are hihgly dependent. For the AVAS data, the observed DELTA times at a specific timepoint will be roughly independent of observations at timepoints far away both in space and time. Therefore, one strategy would be to bucket the data by Route or Pattern. This would assume that the DELTA times are independent between buses that dont share a pattern but go through the same section of the city. Given traffic patterns as a

DSSG CTA PROJECT PLANNING

Plan Summary

Summary of Project Plan

Issues

Milestones

Project milestones

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Planning To Work On

Data Management

Statistics

Clone this wiki locally