-
Notifications
You must be signed in to change notification settings - Fork 0
/
thesis.tex
479 lines (357 loc) · 66.4 KB
/
thesis.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
%###
% \RequirePackage[l2tabu, orthodox]{nag}
\documentclass[a4paper,10pt,twoside]{book}
% \usepackage[width=127mm,height=191mm,includehead,hcentering,vcentering]{geometry}
% \includeonly{_init,basic-concepts,raster}
% \includeonly{__init,k3tree}
\input{___header}
\makeglossaries
\begin{document}
%\epstopdfsetup{outdir=./}
\pagenumbering{roman}
\pagestyle{fancy}\fancyfoot{}\fancyhead{}
\fancyhead[LO]{\slshape\nouppercase{\rightmark}}
\fancyhead[RE]{\slshape\nouppercase{\leftmark}}
\fancyhead[RO,LE]{\slshape \thepage}
\frontmatter
\input{_acronyms.tex}
\include{_init}
\mainmatter
\chapter{Introduction}
\section{Motivation}
The last years have seen a widespread adoption of technological methods focused in registering the movements of individuals, most notably the \textit{smartphone} apps for navigation and map visualization, which often collect the followed GPS trajectories. Similar applications have also been adopted by companies that manage fleets of vehicles, such as taxi and emergency services. When the data about a large amount of these individual trajectories is collected and aggregated, it can be used to infer mobility patterns \cite{liu2012understanding} or, if the collection is complete enough, to model a traffic scenario from the collected sample, as for example was shown in \cite{jiang2009street} with the streets of Hong Kong.
Moreover, in the context of public transportation networks, transportation companies have adopted numerous advances in wireless technologies, sensor networks (especially those related to RFID) and ubiquitous computing, leading to a widespread adoption of passenger tracking technology by public transportation services, making the collection of large amounts of data about the travel habits of these passengers\footnote{Alternatively called ``users'' in the context of transportation companies.} easier than ever before.
This in turn has opened the door for the exploitation of this kind of information to study the demand (usage) of a network, as opposed to the well-known techniques to analyze the offer (routes, timetables, etc...).
%For these applications, it is not the data about individual trajectories but measures of the use of the network what matters for traffic monitoring and road planning tasks.
Examples of useful analysis tasks can be the measures of accessibility and centrality indicators, referred to how easy is to reach different locations or how important certain stops are within a network~\cite{Morency2007193, El-Geneidy2011, Wang2015335}. All these measures are based on some kind of counting queries that determine the number of \textit{distinct} trips that occur within a spatial and/or temporal window.
To enable these new kinds of \textit{demand} studies, it is imperative to develop mechanisms to efficiently persist and manage these vast (and always increasing) collections of data. When we also take into account that efficient query patterns need to be supported for this data to be ``useful'', the solution clearly constitutes an emerging technological challenge that is being approached from several different domains, and hundreds of ad-hoc solutions have been implemented by all the \textit{Smart Cities} around the globe.
%In transportation systems, new technologies such as automatic fare collection (e.g., smartcards) and automatic passenger counting have made possible to generate a huge amount of highly detailed,
%real-time data useful to define measures that characterize a transportation network. This data is particularly useful because it actually consists of real trips, combining implicitly the service offered by a public transportation system with the demand for the system. When having this data, it is not the data about individual trajectories but measures of the use of the network what matters for traffic monitoring and road planning tasks. Examples of useful measures are accessibility and centrality indicators, referred to how easy is to reach locations or how important certain stops are within a network~\cite{Morency2007193, El-Geneidy2011, Wang2015335}. All these measures are based on some kind of counting queries that determine the number of \textit{distinct} trips that occur within a spatial and/or temporal window.
Therefore, it follows that a practical representation that supports efficient indexing for not only collected GPS trajectories, but also collections of trips over a transportation network, would have numerous possible applications.
In \cite{tu2018spatial} we can see how it is possible to combine GPS trajectories with \gls{afc} data to recreate complete trips and study the ridership by area. Alternatively, in \cite{weng2018mining} the complete trips are inferred from the \gls{afc} data, to later analyze behaviour patterns and preferences of the travelers with the goal of improving the efficiency of the network. Another application that is enabled by such analysis is targeted advertising \cite{zhang2017targeted}, as the interests of a user can be profiled by their travel patterns. Other works focus on analyzing the usage of individual stops or stations, such as \cite{ceapa2012avoiding}, where the authors determine that congestion times in the metro network of London are predictable and occur in narrow time intervals. Armed with such information, an user may choose a different travel pattern to avoid the crowd and enhance their overall experience.
When we consider practical studies focused on trajectories over street networks, we can find works centered around studying taxi ridership. One notable example is \cite{yuan2013t}, that discusses a two-way taxi recommendation system, where taxi drivers are pointed to the most profitable parking spaces while passengers are directed to the street segments with a high probability of finding a vacant taxi.
One key observation from all the works referenced above is that a mere collection of trajectories or time-stamped points over a two-dimensional space of latitude and longitude would not be rich enough to perform these studies. They are therefore required to work with a representation that allows for some degree of \textit{semantic} information. At the very least, that information must include references to network elements (stops, lines or streets), and sometimes even some (anonymized) user identifier. Therefore, we require a representation that differs from the traditional spatial indexes and databases, as it must support efficient access methods based on network elements.
\section{Problem definition}
\label{sec:pd}
Considering the nature of the discussed problems, and also the different sources of information, we have identified two distinguishable contexts for transportation analysis:
%Considering the underlying network, we have identified two distinguishable contexts for transportation networks:
\begin{figure*}[t]
\begin{center}
\begin{tabular}{cc}
\includegraphics[width=0.4\textwidth]{figures/road-network.png} &
\includegraphics[width=0.4\textwidth]{figures/mtl-metro-cut.png} \\
\end{tabular}
\end{center}
\caption{Example of a typical urban street layout (left) and a subway network (right).}
\caption*{\small Sources: \cite{boeing2017osmnx} (left), \url{http://www.stm.info/fr/infos/reseaux/metro} (right)}
\label{fig:networks}
\end{figure*}
\subsection{Trajectories over urban streets}
In this context, a trajectory can start anywhere, at any time, and follow any arbitrary path of segments along of streets, which can have either of two alternative graph representations: assign a vertex for each intersection and edges for street segments that connect these intersections, or alternatively, assign a vertex for each street segment with no intersections and edges that connect navigable street segments.
In any case, the geolocated trajectory must be mapped to a path of street segments in order to enable analyzing the movement patterns between locations (which are always bound to streets in urban contexts) and average traffic load at a given time of the day.
Trajectories of taxis, bicycles, or vehicle fleets fall into this context. For these systems, queries of interest may involve retrieving the points of interest where these trajectories could end, or street segments that could be part of a given path.
Our definition also requires to be able to speak for a concept of time.
For these trajectories where an object moves freely over the network of streets, a representation of the time intervals when each street segment was traversed may be considered. In order to achieve a compact representation, time would be expressed in discrete intervals ranging from one to thirty minutes. Larger sizes for these time intervals may not be practical since many of the trajectories would be completed in less than thirty minutes, thus fitting completely in the same interval, thus defeating the purpose of representing a time interval for every segment of the path.
\subsection{Trips over public transportation networks}
In this case, instead of individual trajectories we consider \textbf{trips}, which must start at predefined points (stops or stations) at set times that are defined by the vehicles that stop at those points. These vehicles follow predefined paths along these points, following lines.
Therefore, a trip follows a path along a network of stops and lines, which can be formed over a street network as in the case of buses, or hold very little relation to the streets, as in the case of a subway. However, even when the transportation network is related to a street network, it is more interesting to define the trips using the stops and lines from the transportation network, as enumerating the individual street segments that are traversed would have a high redundancy with no clear benefit.
Note that defining user trips over these network elements will still produce redundant collections, as users that travel in the same vehicle would produce identical parts of trips. This property is one of the keys to achieve a compact representation, since the analysis tasks revolve around the usage and trip patterns of network elements (stops and lines) instead of individual trips.
This context applies to bus and metro systems, along train lines and periurban buses.
%In this context, most of the queries of interest will revolve around the main network elements, which are routes and stops.
In this working context, we will operate with a \textit{network model}.
%which tends to be quite simple in urban street networks, merely consisting of a directed graph with street segments as nodes, where connection to other street segments indicate that direct navigation is possible.
For a public transportation network, it could be pertinent to consider a more rich representation than a mere graph of stops and lines, thus also taking into account the routes formed by transportation vehicles, such as buses or trains, visiting stops at set times and allowing commuters to board or alight at them. A well-known model that includes these network elements, among others less interesting for our problem, is the \gls{gtfs}\footnote{\url{https://developers.google.com/transit/gtfs/}}, which is widely adopted by open data platforms in numerous \textit{Smart Cities}.
Therefore, a \textit{trip} will be defined as a path formed by a sequence of stops, that was traversed by a single passenger/commuter in one trip, with an origin and a final destination. In this definition, we must consider some practical limitations to the nature of a trip, as one could argue whether commuters that take more than one hour to switch a line are actually switching or just finalized their initial trip and are starting a second one with some new destination. These cases are complicated to unambiguously decide in practice, and therefore our approach will tend to set hard limits on waiting times and walking distances between stops for a single trip.
In a network model where the routes are formed by transport vehicles that follow lines, there would be no need for representing the exact time at which each user has boarded on a stop. We will only have to asign a route identifier, as the stopping times would be already available in our modelled network, thus avoiding some redundancy in the representation of trips.
Massive data collection techniques exist for both of the contexts discussed above, as will be later seen in Section~\ref{sec:dm}, leading to the problem of efficiently handling these vasts amounts of information that both contexts produce. Apart from the usual well-known \textit{Big Data} solutions (Hadoop, Spark, Druid\ldots), there is an ongoing research line on the application of succint data structures for some of these goals. In particular, it is possible to apply many of the techniques from the field of \textit{Compact Data Structures} to build autoindexed representations that support efficient query patterns tailored for specific information needs, while offering some sort of compression with respect to a more traditional representation.
A usable solution would also require an user interface that enables the exploitation of this information by researchers, transportation companies, city administrations and any other kind of end users. This interface must, at the very least, allow visualizing the network elements on a map, also granting the ability to make queries over these elements in an intuitive and responsive way, while respecting the usual quality principles of any user-oriented software of this kind.
\section{Contributions}
As the two contexts from Section~\ref{sec:pd}~lead to different ways of structuring and querying the information, it is natural to expect at least two different representations, one for each context. For this reason, in this work we have applied compact data structures and techniques to design novel representations that are able to handle massive collections of data related to user moving and transportation habits. Our proposed representations cover both contexts, as well as offer a fair trade-off for different query needs, while at the same time ensuring that our proposals can be implemented in a real-world scenario, for which we evaluate the performance of our representations with realistic query cases over real datasets.\footnote{When needed, the real data was augmented or mixed with synthetic information.}
Furthermore, as a proof of the practicability of our approach, we have developed an end-user application that is based on our representations, and allows a user to perform queries through a \gls{gis} web interface. Unlike traditional \gls{gis} interfaces, ours is focused on offering convenient methods to query historical user trips by the network elements instead of purely spatial relationships, and achieves far superior performance when compared to systems backed by traditional database systems.
In conclusion, we present an end-to-end platform around representations based on compact data structures to process, store, query and visualize mined public transportation usage data. To the best of our knowledge, this is the first work to accomplish building such integrated platform, although other works exist that contemplate the use of compact data structures for trajectories or moving objects (see Section~\ref{sec:ti}).
\section{Outline}
The rest of this thesis is structured as follows: In the following Chapter~\ref{ch:state}, we review the literature on existing methods for trajectory extraction and indexing. The former is interesting to our work as it studies different approaches to obtain the trips over public transportation systems, which our work focuses on, while the latter discusses alternatives for supporting some of the traditional queries over trajectories, that are often based on secondary memory.
After that, our contributions are grouped into two parts:
\begin{itemize}
\item In \textbf{Part~\ref{part:cds}}, we propose several representations based on compact data structures to efficiently handle the analysis of trips for both transportation contexts discussed in Section~\ref{sec:pd}. It is divided in three chapters:
\begin{itemize}
\item In Chapter~\ref{ch:cds:prev}, we describe the underlying structures that are used by our representations, with a brief description of the memory requirements of each one of them, as well as the temporal complexities of their main algorithms.
\item Chapter~\ref{sec:ctr} is devoted to \gls{ctr}, our representation for trajectories in the context of urban streets. We use separate structures for the spatial and temporal representations, and combine them so that they can be used to solve spatio-temporal queries.
\item In Chapter~\ref{sec:newctr}, we discuss the problems that are specific for trips over public transportation networks, and provide two alternative representations, \gls{ttctr} and \gls{xctr}, based on a common model, along with an additional complementary structure that can be used to accelerate some of the aggregation queries.
\end{itemize}
\item In \textbf{Part~\ref{part:gis}}, we present an interface designed to aid on the decision making process of a public transportation company, that relies on the representations from Part~\ref{part:cds}. This part consists of two chapters:
\begin{itemize}
\item Chapter~\ref{ch:gis:prev} introduces the reader to some of the basic concepts of \gls{gis}, as well as the technologies used in our developed interface.
\item Chapter~\ref{sec:gis} contains the detailed description of our application, discussing its architecture and our user interface to analyze historical information of user trips over public transportation networks.
\end{itemize}
\end{itemize}
After these parts, \textbf{Part~\ref{part:final}} summarizes our contributions in one single concluding Chapter~\ref{ch:concl}, where we also discuss the future developments planned for the work exposed in this thesis.
Finally, we also include two appendices: Appendix~\ref{ch:appendix-publications} enumerates the relevant research works that have derived from this thesis, while Appendix~\ref{ch:appendix-spanishsummary} provides a summary of the overall thesis in the Spanish language, as required by the current regulations in the PhD program that this thesis is submitted for.
\chapter{State of the art in trajectory extraction and representation}
\label{ch:state}
This chapter is devoted to provide a context to our contributions with a literature review of tangential works. We start by discussing methods of collecting useful trajectory data, followed by a review of mobile objects and models for the trajectories they generate, as well as types of queries that are often found in mobile objects research. Finally, we look into the most relevant works in trajectory indexing, which contrast with our contributions as the latter are based on auto-indexed representations, and are focused on problems specific to the study of transportation demand and travel patterns on transportation networks.
\section{Trajectory extraction}
\label{sec:dm}
In order to make significant conclusions about travel patterns, it is essential to be able to collect a large enough collection of trips so that it would become representative of the overall usage within a time span. For this problem, crowd-sourcing can be a viable approach to record the trajectories of public transportation users, as done in \cite{zimmerman2011field}, where the users were rewarded with real information on the bus location, estimation of arrival time and fullness in exchange of recording the GPS trace on their own bus trips.
Currently there are several known techniques that would allow to collect data regarding users' trips over a public transportation network. Numerous works exist where those trajectories are mined from the transactions of smart cards \cite{bhaskar2015passenger,wang2014aggregated}. This can be complemented with information derived from GPS devices, as shown in \cite{ma2014development}. Alternatively, reliable trajectories may be extracted relying on positioning obtained from cellular networks, as proven by \cite{liu2017exploring}.
Because smart card methods usually provide only information about boarding stops, there are works that study the challenge of inferring alighting stops from passengers \cite{wang2011review}. In addition, the authors of \cite{tao2014exploring} have specifically tackled the challenge of reconstructing full trajectories, accounting for trip-chaining, by using data obtained from smart cards. Furthermore, in \cite{alsger2016validating} it is also proven that not only the alighting stops, but also the (last) destination stop of a trip can be estimated from boarding data gathered by a smart card, within a reasonable accuracy. This is possible because a transportation network user typically makes a return trip at another time of the day, as happens often with people that commute from their homes to work and back. We find these methods particularly interesting, as a significant portion of the query patterns we propose in order to analyze passenger movements rely on knowing about line switches (trip-chaining) and the ultimate destination of a trajectory.
However, there is little research on methods for exploitation of this extracted information, in order to analyze and improve the efficiency of a transportation network, which constitutes the scope of our work. This summarized review proves that, although we were not able to access real data from a public transportation company for this work, such curated information about users' trips can indeed be obtained and used for our proposed representations.
\section{Models of trajectory and types of queries}
A good place to begin searching for practical models of trajectory representation is the vast amount of work on designing data models for moving objects~\cite{DBLP:conf/ssdbm/WolfsonXCJ98,DBLP:conf/icde/SistlaWCD97,DBLP:journals/tods/GutingBEJLSV00,DBLP:conf/chorochronos/GutingBEJLNSV03,DBLP:journals/geoinformatica/Spaccapietra01,DBLP:conf/sigmod/ForlizziGNS00,DBLP:journals/geoinformatica/ErwigGSV99,DBLP:books/mk/GutingS2005}. Basically, a data model for moving objects represents the continuous change of the location of an object over time, this way defining a trajectory.
Handling moving objects can be seen as a big data problem, where solutions may typically differ in the representation of location, contextual or environmental information where the movement takes place, the time dimension (which can be continuous or discrete) and the level of abstraction or granularity on which the trajectories are described~\cite{DBLP:journals/sigspatial/DamianiIGV15}. A common classification of trajectories distinguishes free from network-based trajectories. \textit{Free trajectories} or Euclidean trajectories are a sequence of GPS points represented by an ad-hoc data type of moving points~\cite{DBLP:conf/ssdbm/WolfsonXCJ98,DBLP:conf/icde/SistlaWCD97,DBLP:journals/tods/GutingBEJLSV00}. \textit{Network-constrained} trajectories are a temporal-ordered sequence of locations on networks. This trajectory model includes a data type for representing networks and for representing the relative location of static and moving points on the network~\cite{guting2006modeling}. When we want to represent trajectories over an urban street network, it is often useful to deal with \textit{\mbox{network-matched} trajectories}, as this will allow to represent large collections of trajectories more effectively. The process of obtaining these \mbox{network-matched} trajectories from GPS points is called \textit{\mbox{map-matching}} \cite{brakatsoulas2005map}. This process can also be done online, as recently shown in~\cite{DBLP:journals/tits/Ding0GL15}, where all the processing is done in the server, eliminating the need for any map or network representation at the moving object side.
The definition of trajectories at an abstract level must be materialized in an internal representation with access methods for query processing. An early and broad classification of spatio-temporal queries for \textit{historical positions} of moving objects \cite{DBLP:conf/vldb/PfoserJT00} identifies coordinate- and \mbox{trajectory-based} queries. \mbox{Coordinate-based} queries include the well-known {\it time-slice}, {\it time-interval} and \textit{nearest-neighbor queries}. Typical examples are \textit{``find objects or trajectories in a region at a particular time instant or during some time interval''} or also \textit{``find the k-closest objects with respect to a given point at a given time instant''}. \mbox{Trajectory-based} queries involve topology of trajectories (e.g., overlap and disjoint) and related information (e.g., speed, area, and heading) that can be derived from the combination of time and space. An example of such queries would be \textit{``find objects or trajectories that satisfy a spatial predicate (eg., leave or enter a region) at a particular time instant or time interval''}. There also exist combined queries addressing information of particular objects: \textit{``Where was object X at a particular time instant or time interval?''}. In all previous queries, the results are the individual trajectories that satisfy the query constraints.
%When dealing with large datasets of trajectories, and due to privacy issues, the individual trajectory cannot be revealed, then anonymized and aggregated trajectories are of concern.
In many applications, aggregated trajectories must be analyzed in the collection, as an individual trajectory does not represent any representative information.
In this context, we can further distinguish range- from trajectory-based queries. Range queries impose constraints in terms of a spatial location and temporal interval. Examples of these queries are \textit{``to retrieve the number of distinct trajectories that intersect a spatial region or spatial location (stop) at a given time instant or time interval''}, \textit{``retrieve the number of distinct trajectories that start at a particular location (stop) or in a region and/or end in another particular location of region''}, \textit{``retrieve the number of trajectories that follow a path''},
and \textit{``retrieve the top-k locations (stops) or regions with the larger number of trajectories that intersect at a given time instant or time interval''}. Trajectory-based queries require not only to use the spatio-temporal points of trajectories but also the sequence of these points. Examples of such queries are to \textit{``find the number of trajectories that are heading (not necessarily ending at) to a spatial location during a time interval''}, \textit{``find the destination of trajectories that are passing through a region during a time interval''} or also \textit{``find the number of starting locations of trajectories that go or pass through a region during a time interval''}.
\section{Trajectory indexing}
\label{sec:ti}
Literature on spatial trajectory indexing can be categorized by the nature of the trajectories: they can be either constrained to a network or in free space (often called moving objects). While there are well-known queries for indexes that work on moving objects in free space \cite{DBLP:conf/vldb/PfoserJT00}, the \mbox{network-constrained} trajectory indexes cover more diverse querying needs, as different networks involve different kinds of challenges (as in vehicular road network vs public transportation network), and also because the intended application may vary (analyzing trip-chaining patterns vs number of passengers within a time window). A comprehensive review on indexing methods can be found in \cite[Chapter 4]{DBLP:books/sp/PelekisT14}, to which we shall expand in the rest of this section to mention the most remarkable indexes and some new developments.
\subsection{Free trajectory indexing}
Several adaptations of the {\em R-Tree} \cite{DBLP:conf/sigmod/Guttman84} are widely used for the indexing of moving objects. The most common approach is to integrate the temporal dimension in the R-Tree itself, as found in the {\em STR-Tree} and the {\em TB-Tree} from \cite{DBLP:conf/vldb/PfoserJT00}.
Another common approach is to complement the R-Tree with another similar structure, as has been done for the {\em MV3R-tree}~\cite{DBLP:conf/vldb/PapadiasT01},
where an Historical {\em R-Tree}~\cite{nascimento1998towards} was used to partition on the temporal dimension.
Generally, even for very large collections of trajectories, the spatial dimensions are more bounded than the temporal dimension, which can grow indefinitely. For this reason, even for free trajectories, temporal dimension is generally more selective than spatial dimensions. This observation was heavily exploited in the subsequent works, such as {\em SETI} from \cite{chakka2003indexing}, where the space is partitioned statically, while trajectories are indexed by their temporal dimension using a one dimensional R-Tree, allowing it to grow dynamically.
Recently a framework based on Apache Spark was developed called {\em UlTraMan}~ \cite{ding2018ultraman}, that supports different kinds of partitioning schemes for large collections of trajectories to answer range, kNN, or aggregation queries, allowing to repartition the dataset to maximize the query efficiency for a given query type. Although UlTraMan has been tested over \mbox{network-constrained} trajectories, it does not appear to be exploiting network information in any way, hence its inclusion in this category.
In the field of compact data structures, an index called {\em GraCT}~\cite{brisaboa2019gract} has been developed. It is based on representing snapshots at regular time intervals using {\em \mbox{k$^2$-Trees}}~\cite{brisaboa2009k} and keeping movement logs for individual trajectories. Such movement logs are grammar compressed by using {\em RePair}~\cite{larsson2000off}. Because of this, GraCT is a self-indexed compact representation that supports spatio-temporal range and nearest-neighbor queries, as well as allowing for direct access to the trajectory information.
\subsection{Network-constrained trajectory indexing}
There are also numerous approaches that use \mbox{R-Tree-based} indexes for trajectories that are constrained to an underlying network, aiming to decrease the redundancy in the representation by separating the representation into levels. Examples of such indexes include the {\em FNR-Tree}~\cite{DBLP:conf/ssd/Frentzos03}, where the network elements are indexed with a 2D R-Tree. Every network element at a leaf node of this R-Tree points to a 1D R-Tree that is used to index the start and end of the time intervals at which moving objects pass through that edge of the network. As such, the FNR-Tree is only capable of recording simple movements where the edges are assumed to be traversed at constant speed. These limitations are addressed in \cite{DBLP:journals/geoinformatica/AlmeidaG05}, where the authors propose the {\em MON-Tree}, where the moving objects were indexed in two dimensional R-Trees (the dimensions being edge position vs time). More recent alternatives, such as the {\em TMN-Tree}~\cite{chang2010tmn}, integrate {\em B$^+$-Trees}, which have proven to be more space and time efficient for indexing the temporal dimension. Alternatively, in \cite{rivera2018faster} a compact representation of time intervals is proposed using two bitvectors, that can be combined with those \mbox{R-Tree-based} indexes to increase the efficiency of the temporal filtering. Refer to \cite{john2017performance} for a quick comparison of these \mbox{R-Tree-based} indexes.
As a competitive alternative to these \mbox{R-Tree-based} indexes, {\em PARINET}~\cite{DBLP:journals/vldb/PopaZOBV11} builds spatial partitions from the trajectories based on their distribution and network topology, and uses a B$^+$-Tree to index the trajectories in each partition by time intervals. Although candidate trajectories must be filtered in memory during queries, PARINET is highly efficient in practice as its partitioning strategy minimizes the number of disk accesses needed.
%The same ideas were used in {\em TRIFL}~\cite{that2015trifl}, where the cost model is adapted for flash storage.
Another relevant representation is described in \cite{DBLP:conf/gis/KroghPTT14}. Their proposed index, {\em NETTRA}, was designed to efficiently solve a specific kind of queries called {\em Strict Path Queries (SPQ)}, built on a traditional RDBMS with $B^+$-Tree indexes. Another distinctive feature of NETTRA is its treatment of \mbox{network-constrained} trajectories as textual information, which allows to apply string matching techniques such as fingerprinting functions to determine what trajectories have similar paths on their traversed edges. This close equivalence between trajectories and strings has been further exploited by \textit{Geodabs}~\cite{chapuis2018geodabs}, where both the spatial distribution and sequence information are taken into account for finding trajectories by similarity with fingerprinting. These text-based approaches are sometimes tangled with works on the topic of semantic trajectories such as \cite{al2017semantictraj}.
%For vehicles we also have this \cite{cai2018vector} and \cite{lovell2018lossless}.
A recent compact data structure named {\em CiNCT} has been proposed in \cite{koide2018cinct}, where trajectories are encoded into a string, that is used to build an FM-index \cite{DBLP:conf/focs/FerraginaM00} with a Huffman-shaped Wavelet Tree \cite{ferragina2009compressed}. To further save space, the string is constructed with relative movement labels instead of absolute edges, with an auxiliary structure that represents a network graph built from the input trajectories themselves. While this structure excels at pattern-matching and path extraction in vehicular networks (such as the streets of a city), it cannot be applied to our problems in public transportation networks, where network demand and aggregated travel patterns have to be analyzed. Note that finding out which trajectories traversed on a specific path of sequential street segments does not provide us much relevant information for our needs.
\part{Compact representation for trajectories over transportation networks} \label{part:cds}
\chapter{Previous concepts}
\label{ch:cds:prev}
This chapter presents a brief discussion of the compact data structures used in our proposed representations, and it gives a background of the underlying concepts that we will be working with for the rest of this part. For a more in-depth guide on compact data structures, the reader may refer to \cite{Nav16}.
%, where they are introduced from basic information theory concepts such as entropy.
A reader who is already familiar with compact data structures is still encouraged to read Section~\ref{sec:sat}, where the \gls{sat} is introduced, which is initially unrelated to compact data structures, and also Section~\ref{sec:wt}, where we discuss some less known variants and operations of the \gls{wt}, that we make use of in our contributions.
\section{Summed Area Table}
\label{sec:sat}
The {\em \acrfull{sat}} was first introduced in computer graphics \cite{crow1984summed} to speed up the mipmapping process, where given a texture image represented as a series of bidimensional matrices of numbers (usually three matrices of integers, one for each color channel) we are interested in finding the average color of any arbitrary rectangle within the image. This operation is most often used to reduce the rendering time for distant polygons where a pixel on the target screen may correspond to several texture pixels (texels), and also for anisotropic filtering, in order to improve the visual quality of polygons that are projected in an oblique angle.
With the most direct representation of a matrix as an array of values, we are required to compute the average value of a rectangle $[(a,b),(a+h,b+w)]$ as:
\[
\overline{M}[a..a+h,b..b+w]\leftarrow
\frac{\displaystyle\sum^h_{i=0}\displaystyle\sum^w_{j=0}M[i+a,j+b]}
{(h+1)(w+1)}
\]
Note that the summation has a time complexity of $O(hw)$, which would make this operation quite expensive for \mbox{real-time} rendering applications. In order to decrease the complexity of these calculations, the \gls{sat} precomputes the summations of $M$ in a matrix $I$, of the same size, where $I[a,b]\leftarrow\displaystyle\sum^a_{i=0}\displaystyle\sum^b_{j=0}M[i,j]$. That is, each cell of $I$ is the sum of the values within the rectangle spanning from the origin of the matrix to the position of the cell. An example of a \gls{sat} is shown in Figure~\ref{fig:sat}.
\begin{figure}[ht]
\begin{center}
{\includegraphics[width=0.8\textwidth]{figures/example_sat.eps}}
\end{center}
\caption{An example of a \acrlong{sat} (right) built over a matrix (left).}
\label{fig:sat}
\end{figure}
Having $I$, we can compute the average value of a rectangle in $O(1)$ operations as:
\[
\overline{M}[a..a+h,b..b+w]\leftarrow
\frac{I[a+h,b+w] - I[a+h,b-1] - I[a-1,b+w] + I[a-1,b-1]}
{(h+1)(w+1)}
\]
In our example from Figure~\ref{fig:sat}, to calculate the sum of the delimited $3x3$ region in the matrix $M$, we simply operate over the four terms in \textbf{bold} from $I$, obtaining
$\displaystyle\sum^4_{i=2}\displaystyle\sum^4_{j=2}M[i,j]~=~I[4,4]-I[4,1]-I[1,4]+I[1,1]~=~84-27-18+8~=~47$.
While with this representation we do not need to keep the original matrix,\footnote{As accessing a single cell can be viewed as a special case of the computation of an average where $h=0,w=0$.} the improved computational efficiency comes at the expense of having to represent larger numbers than the original values.
\section{Entropy coding}
Given an information source (such as a text) that provides symbols from an alphabet $[1..\sigma]$ with a probability of $0 \leq p_i \leq 1$ for each symbol, where $\displaystyle\sum_ip_i=1$, the goal of an entropy coder is to exploit these probabilities in order to achieve compression by assigning shorter codes to the most frequent symbols, and longer codes to the less frequent ones. In the work that is considered as the foundation of information theory \cite{shannon1948mathematical}, Claude Shannon defined a concept called \textit{entropy}, which is closely related to the probabilities we are discussing, and used it to prove that when encoding the symbols in binary, the optimal length in bits for each code is of $l_i = \frac{1}{\log_2p_i} = -\log_2p_i$ bits, and the entropy of an information source $S$ is calculated as $H_0(S) = -\displaystyle\sum^\sigma_i p_il_i = -\displaystyle\sum^\sigma_i p_i\log_2p_i$.
For any compression technique relying on using variable-length codes, it is necessary for the codes to be unambiguous: there can be no two codes $C_i, C_j$ that, when concatenated, could be interpreted as another code
% esto esta mal porque no contempla el caso de concatenar tres o mas codigos ambiguos
%($C_iC_j \neq C_k, \forall i,j,k \in [1..\sigma]$)
. Additionally, it is computationally very useful for those codes to be also prefix-free, meaning that there can be no code that is the prefix part of another code ($C_i \neq C_j\{0|1\}^a,~\forall i,j \in [1..\sigma], i\neq j, a \in [0..\infty)$). This will allow us to unambiguously interpret the symbol right after the bits of $C_i$, without having to determine if it could be the prefix part of some other longer code $C_j$.
The Huffman coding, introduced in \cite{huffman1952method}, is a coding algorithm that produces optimal\footnote{That is, with lengths as close to $-\log_2p_i$ as possible with a whole number of bits per symbol.} prefix-free codes based on the frequencies of each symbol. The ideas of the Huffman coding have been widely implemented in numerous compression algorithms and codecs since its inception, where the most notable examples are DEFLATE (PKZIP, GZIP), JPEG and MPEG.
One less known variation of the Huffman codes are the Hu-Tucker codes \cite{hu1971optimal}, which aim to provide codes that preserve the same lexical order as the original symbols, meaning that for any two symbols $s_i,s_j$, it holds that $s_i \prec s_j \iff C_i \prec C_j$. This comes at the expense of at most one extra bit per code over Huffman on average.\footnote{Being $L_h$ and $L_{ht}$ the average codeword
length of Huffman coding and Hu-Tucker codes for $S$ respectively, it holds: $H_0(S) \leq L_h \leq H_0(S)+1$ and $H_0(S) \leq L_{ht} \leq H_0+2(S)$
(see \cite{Cover:2006:EIT:1146355} (pages 122-123), or \cite{HORIBE1977148, GilbertandMore1959}).}
\section{Bitvectors}
\label{sec:bit}
A vast amount of works in compact data structures involves the use of bitvectors, both compressed and uncompressed. A bitvector $B[1..n]$ is a sequence of $n$ bits, for which the following operations are expected to be supported:
\begin{itemize}
\item $\rank_1(B,i)$ is the number of set bits in $B[1..i]$. Alternatively, $\rank_0(B,i)\leftarrow i - \rank_1(B,i)$. Consequently, it also holds that the bit from the position $i$ can be retrieved as $B[i] = \rank_1(B,i) - \rank_1(B,i-1)$, with a special case of $\rank_1(B,0) = 0$.
\item $\select_1(B,i)$ is the position in $[1..n]$ where the $i$-th 1 occurs. Therefore, $\rank_1(B,\select_1(B,i)) = i$. An equivalent version for 0 bits may be defined as $\select_0(B,i)$, although, unlike $\rank_0$, there does not exist a direct way of obtaining it from the previously defined operations.
\end{itemize}
\begin{example}
Given a bitvector $B = 011001$, it holds that $\rank_1(B,1) = 0$, $\rank_1(B,2) = 1$, $\rank_1(B,3) = 2, \rank_1(B,4) = 2$. Furthermore, $B[3] = \rank_1(B,3) - \rank_1(B,2) = 1$ and $\rank_0(B,3) = 3 - \rank_1(B,3) = 1$.
We can also say that $\select_1(B,2) = 3$ and $\select_1(B,3) = 6$, thus holding that $\rank_1(B,\select_1(B,2)) = \rank_1(B,3) = 2$ and $\rank_1(B,\select_1(B,3)) = \rank_1(B,6) = 3$. Additionally, $\select_0(B,2) = 4$ as $\rank_0(B,\select_0(B,2)) = \rank_0(B,4) = 2$.
\qed
\end{example}
All these operations can be supported in $O(1)$ time with $o(n)$ extra bits \cite{Jac89,Mun96}.
Additionally, there exist several techniques for compressing these bitvectors based on their statistical properties or the arrangement of the bits. In this work we will use a representation that excels in efficiently solving $\select_1$ operations for sparse bitvectors (with a low number of set bits in proportion to the bitvector size) introduced in \cite{okanohara2007practical}, and also another representation that works particularly well with non uniformly distributed bitvectors, thus exploiting their entropy to require only $nH_0(B) + o(n)$ bits\footnote{Being $n_1$ the number of set bits in $B$ and $n_0 = n - n_1$, then $H_0(B) = \dfrac{n_1\log\dfrac{n}{n_1} + n_0\log\dfrac{n}{n_0}}{n} \leq 1$.} \cite{Raman:2002:SID:545381.545411}.
\section{Wavelet Tree and Wavelet Matrix}
\label{sec:wt}
When working with a sequence $S[1..n]$ built over an alphabet $[1..\sigma]$, we can extend the definitions of $\rank$ and $\select$ from Section~\ref{sec:bit} to work over any symbol $a \in [1..\sigma]$ instead of bits, resulting in the following operations:
\begin{itemize}
\item $\rank_a(S,i)$ gives the number of occurrences of the symbol $a$ in $S[1..i]$.
\item $\select_a(S,i)$ gives the position in $[1..n]$ where the $i$-th $a$ occurs.
\end{itemize}
The \acrfull{wt} \cite{WT03} represents $S$ with a balanced binary tree, with $\sigma$ leaves, one for each symbol. Every internal node $v$ represents the range of symbols $[\alpha_v..\omega_v] \subseteq [1..\sigma]$ in the same order as they appear in $S$, where the root node corresponds to all the $n$ symbols from the alphabet $[1..\sigma]$ appearing in $S$. Its left child $v_l$ will only represent the $n_l$ symbols that fall in the range $[1..\dfrac{\sigma}{2})$, preserving the order that they had in $v_0$, while the right child will represent the other $n_r$ symbols from $[\dfrac{\sigma}{2}..\sigma]$, holding that $n_l+n_r=n$. This is built recursively until a leaf $v_a$ is reached, with will correspond to only one symbol $a \in [1..\sigma]$, with $n_a = \rank_a(S,n)$ times. An example \gls{wt} can be found in Figure~\ref{fig:wt}.
\begin{figure}[ht]
\begin{center}
{\includegraphics[width=0.65\textwidth]{figures/wt1.eps}}
\end{center}
\caption{An example \acrlong{wt} of four levels.}
\label{fig:wt}
\end{figure}
For each node $v$, these symbols from the alphabet $[\alpha_v..\omega_v]$ are represented implicitly using a bitvector $B_v[1..n_v]$, where $B[i] = 0$ when the symbol $a$ represented in the position $i$ should belong to the left child ($a < \dfrac{\alpha_v + \omega_v}{2}$), while $B[i] = 1$ means it should belong to the right child ($a \geq \dfrac{\alpha_v + \omega_v}{2}$). By making use of $\rank$ and $\select$ operations over these bitvectors, we are able to recover the value of any $S[i]$, and also to answer $\rank_a$ and $\select_a$ for any $a \in [1..\sigma]$ in $O(\log\sigma)$ without the need of storing the original sequence. As there are $n$ bits per level and $\log\sigma$ internal levels (the nodes from the leaf level do not have bitvectors), we could represent all the bitvectors in $n\log\sigma$ bits, which is equivalent to the space it would take to represent $S$ as an array of fixed length integers.\footnote{When $\sigma$ is not a power of two, $\log\sigma$ would need to be rounded up for fixed length integers, while for the \gls{wt} there will be some leaves at level $h-1$, being $h$ the height of the tree, and its total size will depend on the frequency of those symbols ($\lfloor n\log\sigma\rfloor < \sum_i n_i < \lceil n\log\sigma\rceil$).} However, we also need some auxiliary structures for $\rank_1$ and $\select_1$, which will require $o(n\log\sigma)$ bits and also store a pointer for every one of the $2\sigma-1$ nodes. Therefore, the total size of this representation amounts to $n\log\sigma + o(n\log\sigma) + O(\sigma\log n)$.
In order to access the value of $S[i]$, we must start by looking into $B_v[i]$ from the root node. If $B_v[i]=0$, we traverse to the position $B_{v_0}[\rank_0(B_v,i)]$ of the bitvector of the left child. Conversely, when $B_v[i]=1$, we traverse to the bitvector of the right child at $B_{v_1}[\rank_1(B_v,i)]$. In both cases, we recurse until a leaf is reached, which will unambiguously correspond to the symbol $a$, thus we determine that $S[i]=a$.
\medskip
\begin{example} \label{exp:wtaccess}
In the \gls{wt} from Figure~\ref{fig:wt}, if we wanted to retrieve the value of $S[2]$, we would start by looking into the bit $B_v[2]=0$, meaning that the we have to traverse to the node $v_0$ and look into the bit $B_{v_0}[\rank_0(B_v,2)] = B_{v_0}[2] = 1$, leading us to the node $v_{01}$ at $B_{v_{01}}[\rank_1(B_{v_0},2)] = B_{v_{01}}[2] = 0$. The left node is a leaf node belonging to the symbol $2$, so $S[2]=2$ (namely, the first occurrence of $2$ in $S$, as $\rank_0(B_{v_{01}}, 2) = 1$).
\qed
\end{example}
We can also solve $\rank_a$ by traversing the tree from top to bottom with the $\rank$ operation on the bitvectors: knowing that the binary representation (and thus also the path in the \gls{wt}) of a symbol will allow us to use $\rank_1$ or $\rank_0$ in each level $i$ according to the $i$-th bit of our code. In the Example~\ref{exp:wtaccess}, knowing that the binary representation of $2$ is $010$, we can calculate $\rank_2(S,2)$ (or any other position) following the same order of operations: a $\rank_0$, a $\rank_1$ and one final $\rank_0$ to determine the position of the leaf node, which will give away $\rank_2$. It is also possible to calculate $\select_a(S,i)$ with a bottom-up traversal of the tree, by starting at the $i$-th position from the leaf belonging to $a$, and applying $\select_0$ and $\select_1$ on the bitvectors of the parent nodes, following the reversed binary code of $a$. Eventually, a position $S[j]=a$ will be reached such that it will contain the $i$-th occurrence of $a$ in $S$, consequently obtaining that $j=\select(S,i)$.
A more complex operation that we have found very useful in our work is the operation $\cnt_{a,b}(S,i,j)$, first described in \cite{gagie2012new}, which counts the number of occurrences of the symbols in $[a..b]$ within $S[i..j]$. While it is equivalent to $\displaystyle\sum^b_{k=a}\rank_k(S,j) - \rank_k(S,i)$, it can be solved more efficiently by doing two simultaneous top-down traversals, as described for $\rank_a$. Starting at the root node $v$ we calculate $\rank_0(B_v,i)$ and $\rank_0(B_v,j)$ to find out the limits in the left node for the codes within $[a..b]$ that start with 0, and also $\rank_1(B_v,i)$ and $\rank_1(B_v,j)$ for those codes starting by 1. If all the codes in $[a..b]$ start with either a zero or a one, we will only compute the corresponding $\rank$. After that, we recurse for each node, where on the left we set $i\leftarrow\rank_0(B_v,i)$, $j\leftarrow\rank_0(B_v,j)$ and we only consider the range of symbols $[a..b']$ that can be represented by that node ($b' \leq \omega_{v_0})$). The same is true for the recursion on the right, where $i\leftarrow\rank_1(B_v,i)$, $j\leftarrow\rank_1(B_v,j)$ and $\alpha_{v_1} \leq a'$. Therefore, we compute recursively:
\begin{align*}
\cnt_{a,b}(v,i,j) = &\cnt_{a,b'}(v_0,\rank_0(B_v,i),\rank_0(B_v,j))\\
&+~\cnt_{a',b}(v_1,\rank_1(B_v,i),\rank_1(B_v,j))
\end{align*}
The recursion on each node $v$ stops when $a \leq \alpha_v$ and $\omega_v \leq b$, in which case the count for that node is reported as $j-i+1$. Finally, all the counts are summed and the total amount of occurrences is obtained. While it may look as if the worst case would imply traversing all $\sigma$ internal nodes, it is avoided by stopping the recursion when $a \leq \alpha_v$ and $\omega_v \leq b$, thus requiring to visit only $O(\log\sigma)$ nodes.\footnote{In fact, the best case is $cnt_{1,\sigma}(S,i,j) = j-i+1$, which is solved without traversing at all.} In the \gls{wt} from Figure~\ref{fig:wt}, we have marked in red the ranges considered for solving $\cnt_{3,3}(S,5,10)$. Additionally, when the subsequence $S[i..j]$ is sorted, we can define $[l..r]\leftarrow\cnt^{LR}_{a,b}(S,i,j)$, which returns the upper and lower limits $S[l..r]$ of the occurrences of the symbols $[a..b]$. Naturally, $S[l..r] \subseteq S[i..j]$.
\subsection{Hu-Tucker Wavelet Tree}
A straightforward way of reducing the size of a \gls{wt} is to use compressed bitvectors, as discussed in \cite{CNspire08.1}, allowing to represent a \gls{wt} in $nH_0(S) + o(n\log\sigma) + O(\sigma\log n)$ bits \cite{WT03}. There is, however, a different approach to achieve similar space requirements is to use a prefix-free variable-length encoding for the symbols.
For example, Huffman code \cite{huffman1952method} can be used to build a
Huffman-Shaped \gls{wt} \cite{ferragina2009compressed}, where the tree is not balanced
anymore, as the level of each leaf $v_a$ will be the number of bits for the Huffman code of $a$, which will depend on the frequency of $a$ in $S$. The size reduces to $n(H_0(S)+1)+ o(n(H_0(S)+1)) + O(\sigma \log n)$,\footnote{$O(\sigma \log n)$ term
includes both the tree pointers and the size of the Huffman model.} while average time becomes
$O(H_0(S))$ for $\rank_a$ and $\select_a$ (the worst-case time is still $O(\log\sigma)$
\cite{Barbay:2013:CPA:2562345.2562626}). By using compressed
bitvectors \cite{CNspire08.1} space can be even further reduced to $nH_0(S) + o(n(H_0(S)+1)) + O(\sigma \log n)$.
Unfortunately, the Huffman codes (including \textit{canonical Huffman}) assigned to
lexicographically adjacent symbols do not maintain that lexicographic order, and it is not possible to have a $O(\log\sigma)$ bound for $\cnt_{a,b}(S,i,j)$ anymore.
In order to support $\cnt_{a,b}(S,i,j)$ more efficiently, Hu-Tucker codes \cite{hu1971optimal} can be used instead. While the compression achieved by a \gls{htwt} \cite{barbay2009compressed} degreades slightly with respect to using Huffman coding, yielding a bound of $n(H_0(S)+2) + o(n(H_0(S)+1)) + O(\sigma \log n)$,\footnote{This can be reduced to $nH_0(S) + o(n(H_0(S)+1)) + O(\sigma \log n)$ by using compressed bitvectors as well.} the codes for adjacent symbols are lexicographically contiguous. Therefore, we can guarantee a bound of $O(\log\sigma)$ for $\cnt_{a,b}(S,i,j)$ again. An example of a \gls{htwt} in practice can be found in Figure~\ref{fig:wtwm}.
\subsection{Wavelet Matrix}
\label{sec:wm}
For large alphabets, the size of the \gls{wt} is affected by the term $ O(\sigma \log n)$. A {pointerless}
\gls{wt} \cite{CNspire08.1} permits to remove\footnote{In a pointerless Huffman-shaped \gls{wt} a
term $O(\sigma \log\log n)$ still remains due to the need for storing the canonical Huffman model.}
that term by concatenating all the bitvectors level-wise
and computing the values of the pointers during the \gls{wt} traversals.
The operations on a pointerless \gls{wt} have the same time complexity but become slower in practice.
By reorganizing the nodes in each level of a pointerless \gls{wt}, the \acrfull{wm} \cite{CNO15} obtains the same space requirements, yet its performance is very close
to that of the regular \gls{wt} with pointers. Figure~\ref{fig:wm} contains an example of a \gls{wm}, representing the same sequence as in Figure~\ref{fig:wt}.
\begin{figure}[ht]
\begin{center}
{\includegraphics[width=0.65\textwidth]{figures/wm1.eps}}
\end{center}
\caption{An example of a \acrlong{wm} with three levels. A conceptual fourth level was omitted since it does not contain a bitvector.}
\label{fig:wm}
\end{figure}
As in the \gls{wt}, the $i$-th level stores the $i$-th bits of the encoded symbols.
A single bitvector $B_i$ is kept for each level. In the first level, $B_1$ stores the
$1$-st bit of the encoding of the symbols in the order of the original sequence $S$.
From there on, at level $i$, symbols are reordered according to the $(i-1)$-th bit of their encoding;
that is, according to the bit they had in the previous level.
The symbols whose encoding had a zero at position $i-1$ must be arranged before those that
had a one. After that, the relative order from the previous level is maintained. That is, if
a symbol $a$ occurred before some other symbol $b$, and the $(i-1)$-th bit of their encoding
coincides, then $a$ will precede $b$ at level $i$. %Note that, following this rule,
%the symbols in the last level shall be ordered by their reversed bit encoding.
If we simply keep track of the number of zeros at each level $z_l\leftarrow \rank_0(B_l,n)$, we can easily see that the symbol with the $k$-th zero
at level $i-1$ is mapped at position $k$ within $B_i$, whereas the symbol with the $j$-th one at level $i-1$ is
mapped at position $z_l +j$ within $B_i$. This avoids the need for pointers, enabling to retain
the same time complexity of the \gls{wt} operations, including $\cnt_{a,b}(S,i,j)$. For
implementation details see \cite{CNO15,ordonez2015statistical}.
\medskip
\begin{example}
To find out the symbol at $S[8]$, we start by observing that $B_1[8]=0$ and $\rank_0(B_1,8)=5$. We move to the next level where we check position $5$; we see that $B_2[5]=1$ and $\rank_1(B_2,5)=3$. We move to next level and check position $3+z_2 = 3+5 = 8$,
where we finally see $B_3[8]=1$. Therefore, we have decoded the bits $\mathbf{011}$ that correspond to the symbol $S[8] = 3$.
\qed
\end{example}
To reduce the space needs of the \gls{wm} we could use compressed bitvectors as for the \gls{wt}. Yet,
compressing the \gls{wm} by giving it either a Huffman or Hu-Tucker shape is not possible as the reordering of the \gls{wm} would lead to the existence of %holes in the structure
gaps in the bitvectors
that would ruin the process of
tracking symbols during traversals. To overcome this issue, an optimal Huffman-based coding was
specifically developed for wavelet matrices \cite{CNO15, Farina2016}. This allows to obtain space
similar to that of a pointerless Huffman-shaped \gls{wt} but with faster $\rank_a$ and $\select_a$ operations.
Unfortunately, since the encodings of consecutive symbols do not keep the same order, $\cnt_{a,b}(S,i,j)$ is no longer supported in $O(\log\sigma)$ time.
%and computing $\rank_a(S,j)-rank_c(S,i)+1$ is required for each $c$ in $[\alpha,\beta]$.
\section{Compressed Suffix Array}
\label{sec:csa}
Given a sequence $S[1..n]$\footnote{For convention, we establish that $S[n]$ must contain a terminator symbol $\$$ that must be lexicographically smaller than any of the other symbols in $S[1..n-1]$.} built over an alphabet $\Sigma$ of length
$\sigma$, the {\em suffix array} $A[1..n]$ is built over $S$ \cite{MM93}
as a permutation of the positions $i \in [1..n]$ of all the suffixes $S[i..n]$, so that
$S[A[i]..n] \prec S[A[i+1]..n]$ for all $1 \le i < n$
%, being $\prec$ the lexicographic ordering.
Because $A$ contains all the suffixes of $S$ in lexicographic order,
we can use this structure to search for any pattern $P[1..m]$ in time
$O(m \log n)$ by simply performing binary searches for the range $A[l..r]$ that
contains pointers to all the positions in $S$ where $P$ occurs.
%For the construction of the $A$ a linear time construction
%algorithm such as the SA-IS~\cite{nong2011two} could be used.
%which is based on induction sorting.
We can find an example of an suffix array $A$ in Figure~\ref{fig:csa}. Any pattern $P[1..m]$ (such as $ana$) can be delimited by a range $A[l..r]$ (for ``$ana$'' it is $A[3..4]$), where the limits $l$ and $r$ can be found with binary searches due to the suffixes being sorted.
\begin{figure}[ht]
\begin{center}
{\includegraphics[width=0.5\textwidth]{figures/csa.eps}}
\end{center}
\caption{All the structures involved in constructing a \acrlong{csa} over the sequence $S = banana\$$. Note that $S$ and $A$ do not need to be stored.}
\label{fig:csa}
\end{figure}
A straightforward enhancement to avoid storing the original string $S$ is to set up a vocabulary array $V[1..\sigma']$, with all the different symbols from $\Sigma$ appearing in $S$,\footnote{Note that $\sigma' \leq \sigma)$ since some of the symbols from $\Sigma$ may never occur in $S$.} and a bitvector $D[1..n]$ aligned to $A$ so that $D[1]=1$ and $D[i]=1~\iff~S[A[i-1]] \neq S[A[i]]$ for all the other $i \in [2..n]$. This means that $D$ marks with a $1$ the beginning of a range of suffixes pointed from $A$ such that the first symbol of those suffixes coincides. With $D$, keeping $S$ is no longer needed since $S[A[i]] = V[\rank_1(D,i)]$.
We can also replace $A$ as described in Sadakane's \acrfull{csa} \cite{Sad03}, using
another permutation $\Psi[1..n]$ defined in \cite{GV00}, where $\Psi[i] = A^{-1}[A[i]+1]$. That is, for every $A[i]=k$ and $A[j]=k+1$, $\Psi[i]=j$.
%For each position $j$ in $S$ pointed from $A[i]=j$, $\Psi[i]$ gives the position $z$ such that $A[z]$ points to $j+1$.
The special case when $A[i]=n$\footnote{$i=1$ if we use the unique terminator $\$$.} is handled as $\Psi[i]=A^{-1}[1]$, making a cycle.
Therefore, the \gls{csa} is formed by $\Psi$, $D$, and $V$, which are sufficient to simulate binary searches of the interval $A[l..r]$ for the occurrences of $P$ without the need of accessing $A$ nor $S$. In this work, we define that operation as $[l..r]\leftarrow\bsearch(\Psi,P)$.
The symbol $S[A[i]]$ pointed by $A[i]$ can be obtained as
$V[\rank_1(D,i)]$. We can also easily obtain the following symbol from the source sequence $S[A[i]+1]$ as
$V[\rank_1(D, \Psi[i])]$, $S[A[i]+2]$ can be obtained as $V[\rank_1(D, \Psi[\Psi[i]])]$, and so on.
%Therefore, the \gls{csa} replaces $S$, and it does not need $A$ anymore to perform searches.
Although an uncompressed $\Psi$ would have the same space requirements
as $A$, it is highly compressible, since it is formed by $\sigma$ strictly increasing subsequences. By using $\delta$-codes of the gaps (differences of each value with respect to the previous one) it is possible to compress $\Psi$ to around the zero-order entropy of $S$~\cite{Sad03}, with $nH_0(S)+O(n\log\log\sigma)$ bits.
In \cite{NM07} it has been further proved that $\Psi$ can be split into
at most $nH_k+\sigma^k$ (for any $k$) {\em runs} of consecutive values so
that the differences within those runs are always $1$. This allows for a combination of $\delta$-coding of gaps with run-length
encoding (of $1$-$runs$) to achieve a higher-order compression of $\Psi$ without further intervention.
In addition, to maintain fast random access to $\Psi$, absolute
samples at regular intervals (every $t_{\Psi}$ entries) are kept. Parameter
$t_{\Psi}$ implies a space/time trade-off. Larger values lead to better compression of $\Psi$ but slow down access time to non-sampled $\Psi[i]$ values.
In \cite{FBNCPR12}, the authors have adapted Sadakane's \gls{csa} to deal with large (integer-based) alphabets
and created the {\em integer-based CSA} ($iCSA$). They also showed that, in their scenario (natural language text indexing), the best
compression of $\Psi$ was obtained by combining differential encoding of runs with Huffman and
run-length encoding.
\include{1-ctr}
\include{2-xctr}
\include{3-gis}
\part{Thesis summary} \label{part:final}
\chapter{Conclusions and future works}
\label{ch:concl}
In transportation systems from \textit{smart cities}, new technologies such as traffic monitoring, automatic fare collection (e.g., smartcards) and automatic passenger counting have made possible to generate a huge amount of highly detailed,
real-time data useful to define measures that characterize a transportation network. This data is particularly useful because it actually consists of real trips, combining implicitly the demand of the system with either the street network or the service offered by a public transportation system.
At the same time, analyzing these historic records about the demand of the network is within the interest of transportation companies, as well as other entities that may be interested in studying the movements of crowds in an urban context. To make this kind of analysis possible, we have proposed powerful representations based on compact data structures that can easily be adapted for both urban transportation contexts (streets and most public transportation systems), and can handle diverse use cases with considerable memory and time efficiency.
In addition, we have developed a \gls{gis} application that integrates the compact representations developed, and includes a simple \mbox{web-based} user interface that makes it possible to analyze the stored information by an end user in a simple way, not requiring them to have a previous knowledge of the underlying technologies. Furthermore, this user interface further validates the choice of an approach based on compact data structures, thus proving that it is possible to develop a competitive product that may be adopted by an interested organization.
Finally, in the context of compact data structures, we believe that this research work may open a new field of practical applications for these structures and algorithms, as have been previously done for the fields of \textit{information retrieval} and \textit{bioinformatics}.
\section{Contributions}
The following list summarizes the main contributions in our work:
\begin{enumerate}
\item In the context of trips over urban street networks, we have developed a representation based on a modified \gls{csa} and different alternatives of the \gls{wt} called \gls{ctr}, which requires as little as a 36\% of the size of an uncompressed baseline, while handling spatio-temporal queries in the order of several microseconds, while offering configurable trade-offs.
\item We have proposed an extensible model for the representation of trips over a public transportation network, that may be adapted for most transportation systems around the world.
\item We have developed two alternative representations for the context of trips over public transportation networks, \gls{ttctr} and \gls{xctr}, based on the same structures of \gls{ctr}, which can solve most of our proposed queries about network-aware trip patterns in the order of several mmicroseconds, while needing only about 50\% of space compared to a plain (unindexed) representation.
\item We have presented a scheme to compress \acrlong{sat} without affecting the temporal complexity of its operations, and we applied it in \gls{tm} as a structure to accelerate network load queries in the public transportation context.
\item We have developed a \gls{gis} prototype including an user interface to analyze the demand of public transportation networks based on our proposed representations, as well as on \gls{gis} technologies.
\end{enumerate}
\section{Future work}
While there are many possible future developments for our proposed representations for trips over public transportation networks, we are mostly concerned with finding a single representation that can efficiently handle both kinds of proposed queries in Section~\ref{sec:newctr:desc}: those focusing in the network load and those focusing in the trip pattern. Finding such a representation is rather challenging, as most solutions that allow to efficiently aggregate multidimensional data (such as stops and times or journeys) cannot support most of our trip pattern queries.
\medskip
Regarding out {\em Trippy} prototype, which is still under development, our most urgent goals are to support an intuitive visualization for lines that may enable the user to perform queries over them, as well as the addition of means to query the histogram endpoint, to display the usage of a stop or a line over time with dynamic charts. Finally, this interface may be improved to support different interactive visualizations for the total network usage over all the lines within a time range, in addition to reachability queries, where given a stop in the network we are interested in obtaining the average time taken to arrive to any other stop.
\include{appendix}
% \nocite{*}
% \bibliographystyle{apalike}
\bibliographystyle{alpha}
\addcontentsline{toc}{chapter}{\bibname}
\bibliography{thesis}
\vfill \pagebreak \thispagestyle{empty} \mbox{}
\vfill \pagebreak \mbox{} \thispagestyle{empty}
\end{document}
%###
% List of frogs:
% use greek letters (Lambda, etc) for sets of lines and stops, so they are not confused with query subindices
% theorems and corollaries for part I!
% TTCTRq and XCTRq
% remembah: you don't always have a factor of log(\bar{J^l}) for obtaining jcodes because it would be everywhere, and have little to do with the repr themselves. You did not want to use a bitvector because "space".
% remove leaflet duplicated url, and maybe others
% \em vs \textit
% trip vs trajectory
% consistency in algorithm style
% consider including the new spanish paragraphs into main thesis
% footnote after dot