Interlib is a data processing library created to work with interaction data. In their raw form, the interactions of users tell us little and need to be processed into workable and descriptive format. Using this library, you can extract session statistics, sequences, and perform various common processing tasks that I have found useful in the past.
Currently, the package is not available on the pip package manager -- it will be once the library has been open sourced along side some data -- so the process to install involves installing directly from Github. (Please ensure that you have an appropriate Python 3.6+ virtual environment installed and activated before installing this package, I would recommend Anaconda (Miniconda: https://docs.conda.io/en/latest/miniconda.html)).
$ pip install -e git+https://github.com/JonoCX/interaction-lib.git#egg=interlib
The library makes the presumption that you have interaction data in a particular format as a json file:
{"<user_1>": [
{"id": 1,
"user": "<user_1>",
"timestamp": "datetime.datetime(2019, 8, 5, 16, 26, 36, 940000)",
"action_type": "STORY_NAVIGATION",
"action_name": "NARRATIVE_ELEMENT_CHANGE",
"data": {
"romper_type": "STORY_NAVIGATION",
"romper_name": "NARRATIVE_ELEMENT_CHANGE",
"romper_id": "",
"romper_from_state": "null",
"romper_to_state": "Intro Message",
"current_narrative_element": "null",
"current_representation": ""
}}, {}],
"<user_N>": []}
The data snippet above is a single event, in a parsed format, recorded in an interactive experience. It is a dictionary of user and a list of events. If you're working with raw data extracted directly from an experience, then it will not be in the above format. As such, the library includes a utility method to convert raw data (from an SQL dump) into a usable format:
from interlib.util import to_dict
user_events = to_dict('path/to/json.json')
The to_dict
function has additional parameters that can be passed, for example if you're dealing with a large amount of data then perhaps you'll want to split the data into chunks (useful for parallel processing):
user_events = to_dict('path/to/json.json', split = True) # splits into two, or;
user_events = to_dict('path/to/json.json', split = 4) # splits into four
While in some cases, you may want to only process a select group of users:
users = set([<user_1>, <user_2>]) # a set of user id strings
user_events = to_dict('path/to/json.json', users_to_include = users)
There are a range of other parameters that can be set, please explore: View to_dict function
There are a range of statistics that can be extracted using the library:
- Time: hidden time, session length, time to completion, raw session length, reach end;
- Pauses: short (1 to 5 seconds), medium (6 to 15), long (16 to 30), very long (30+), as counts;
- Events: counts and relative frequencies of interaction events;
- Event Frequences: given a set of time thresholds, the frequency for each event in those thresholds
Again, the statistics package works under the assumption that you have parsed the raw events into the format described above.
If you want to calculate all statistics -- excluding event frequencies -- then do the following:
from interlib.preprocessing.statistics import Statistics
interaction_events = set(['NARRATIVE_ELEMENT_CHANGE', 'NEXT_BUTTON_CLICKED', ...]) # set of all user events you want to consider
stats = Statistics(
user_events, # the user -> list of events dictionary
completion_point = 'Make step 25', # the point in the experience determined to be the end
n_jobs = -1 # the number of cores to run on (-1, which is a run on all available cores, is the default)
)
user_statistics = stats.calculate_statistics(interaction_events)
print(user_statistics)
{"<user_1>": {
"hidden_time": 0.0, "time_to_completion": 1722.157, "reach_end": True, "raw_session_length": 1847.197,
"session_length": 1847.197, "SP": 4, "MP": 0, "LP": 4, "VLP": 25,
"NEXT_BUTTON_CLICKED": 56, "BACK_BUTTON_CLICKED": 0, "VIDEO_SCRUBBED": 0, "VOLUME_CHANGED": 0,
"REPEAT_BUTTON_CLICKED": 1, "BROWSER_VISIBILITY_CHANGE": 0, "SWITCH_VIEW_BUTTON_CLICKED": 2,
"NARRATIVE_ELEMENT_CHANGE": 30, "FULLSCREEN_BUTTON_CLICKED": 0, "OVERLAY_BUTTON_CLICKED": 2,
"SUBTITLES_BUTTON_CLICKED": 0, "PLAY_PAUSE_BUTTON_CLICKED": 0, "LINK_CHOICE_CLICKED": 0,
"USER_SET_VARIABLE": 0, "total_events": 91},
"<user_N>": {}, }
If you want to calculate specific statistics, e.g. the time statistics, the library provides that option:
from interlib.preprocessing.statistics import Statistics
# create a Statistics object
stats = Statistics(
user_events,
completion_point = 'Make step 25', # without this, some statistics cannot be calculated (reached end and time to completion)
n_jobs = -1
)
interaction_events = set(['NARRATIVE_ELEMENT_CHANGE', 'NEXT_BUTTON_CLICKED', ...]) # set of all user events you want to consider
# calculate the time statistics
time_statistics = stats.time_statistics()
# calculate pause statistics
pause_statistics = stats.pause_statistics()
# calculate event statistics
event_statistics = stats.event_statistics(
interaction_events = interaction_events
)
# calculate event frequencies
event_frequencies = stats.event_frequencies(
frequencies = [0, 60, 120, 180], # indicates that you want frequencues for minutes 0 to 1, 1 to 2, and 2 to 3.
interaction_events = interaction_events
)
# You can also fetch just the session lengths from the user events
session_lengths = stats.calculate_session_length()
Note regarding n_jobs
: When setting the n_jobs
parameter, if you have a small dataset then use a single (1) core otherwise the default will result in slow performance. I would recommend incrementally increasing the parameter when the data size is over 2GB (i.e., 2 cores for 2 to 4GB, 3 cores for 4 to 6GB, 4+ cores for 6GB+). The parameter also sets how computation is performed throughout the extractor you're working with, i.e. if n_jobs = -1
in Statistics
then all functions in that object will use -1
cores (all).
An alternative data representation is sequences, where the events are processes into a common format and their temporal ordering is preserved. Before starting, you need to define both the interaction events that you want to include in the sequences and aliases (short-hand names):
interaction_events = set(["PLAY_PAUSE_BUTTON_CLICKED", "BACK_BUTTON_CLICKED", ...])
aliases = dict({"PLAY_PAUSE_BUTTON_CLICKED": "PP", "BACK_BUTTON_CLICKED": "BB", ...})
The way in which this extractor works is very similar to Statistics
and there is a presumption that the events are in the same format as previously.
To extract the sequences (presumes you have already a loaded user_events
object):
from interlib.preprocessing.sequences import Sequences
seq = Sequences(user_events) # set-up the Sequences object
user_sequences = seq.get_sequences(interaction_events, aliases)
print(user_sequences)
{
"<user_1>": ["NEC", "MP", "PP", "VLP", "NB", "NB", "NEC", "VLP", ...],
"<user_2>": [],
"<user_N>": []
}
In some analysis cases, n-grams can prove a useful tool to represent sequences. As such, the library provides this option:
n_grams = seq.get_ngrams(n = 3) # extract tri-grams
While the above deals with extracting data representations and features from the data, the ultilty package provides some common functions that may come in handy while working with this type of data. It is by no means exhaustive and it's essentially common functions that I have found useful when processing the data in the past. The main function in util
, to_dict
, has already been covered.
from interlib.util import parse_raw_data, parse_timestamp, to_dataframe
Parsing Raw Data If you have a list of raw events, then you're able to parse these into a format that is recognisable by the library:
parsed_events = parse_raw_data(
raw_data,
datetime_format = "%Y-%m-%d %H:%M:%S.%f",
include_narrative_element_id = False
)
Parsing timestamps
When handling the raw data, a common task was to parse the timestamps into datetime
objects from strings. This function, given a list of events in their raw format and with a timestamp
element, parses the timestamps into datetime
objects:
parsed_events = parse_timestamp(
raw_data,
datetime_format = "%Y-%m-%d %H:%M:%S.%f"
)
To DataFrame
To start analysing the data, I recommend using pandas. To help, there is a utility function that can convert the output from the Statistics
object into a usable dataframe.
df = to_dataframe(user_statistics)
Publish: TODO :)