Database schema notes

Introduction

As of 2015-09-15, the gtfs-archive uses flat-files and directories to track archived GTFS feeds. This is a sketch of a database layout which I plan to use instead.

Feeds

create table feed_lists
( id SERIAL            -- Surrogate key
, url text             -- Where to find new copies of this feed_list. 
                       -- Remote if begins with "http(s)://", local file if "file://".
, description text     -- Description of feed_list, for example "Oregon Public Feeds".
);

An open question, how do we deal with feeds which have existed in previous versions of the feed_list, but not in the current version? Ought we to obsolete all feeds from a particular feed_list each time we fetch a new copy of the feed_list?

create table feeds -- one for each entry of feed_lists
( id SERIAL
, feed_list integer         -- FK to feed_lists
, import_time timestamptz   -- Time feed_list was last fetched, which contained this feed.
, feed_name text            -- Copied from feed_list CSV.
, feed_description text     --  As above.
, gtfs_zip_url text         --  As above.
, license_url text          --  As above.
);

Should we set a constraint on the maximum size of a gtfs_file?

create table gtfs_file
( feed integer                -- FK feeds
, last_modified timestamptz   -- What time did the server report this file was last modified?
, zip_data bytea              -- Full copy of the zip file. Often multiple megabytes.
);

Download_hostname seems important in case we ever run multiple copies of the archive from separate hosts, yet imcomplete. Since in theory, there could be multiple gtfs-archive instances running on a host. Perhaps a gtfs_archive_instance table which maps accordingly? The goal is for Clojure run-times to be able to determine which process is downloading a file, so that they can stop, restart, or monitor it. Honestly, should we be using CURL instead of the Java download code? It seems much better, the only downside being we would drop Windows server compatibility which, so far as I am aware, no-one is using.

create table download_status
( feed integer                -- FK to feeds
, download_began timestamptz  -- When did we begin the download?
, download_hostname text      -- From which gtfs-archive host is the download happening?
, file_name                   -- What's the Unix path to the file name we're downloading to?
, status text                 -- IN ('started', 'finished', 'error', ...)
);

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Database schema notes

Introduction

Feeds

Clone this wiki locally