Skip to content

Library for performing common tasks on interaction data

License

Notifications You must be signed in to change notification settings

JonoCX/interaction-lib

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

91 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

InterLib

Build Status

Interlib is a data processing library created to work with interaction data. In their raw form, the interactions of users tell us little and need to be processed into workable and descriptive format. Using this library, you can extract session statistics, sequences, and perform various common processing tasks that I have found useful in the past.

Installation

Currently, the package is not available on the pip package manager -- it will be once the library has been open sourced along side some data -- so the process to install involves installing directly from Github. (Please ensure that you have an appropriate Python 3.6+ virtual environment installed and activated before installing this package, I would recommend Anaconda (Miniconda: https://docs.conda.io/en/latest/miniconda.html)).

$ pip install -e git+https://github.com/JonoCX/interaction-lib.git#egg=interlib

Usage

The library makes the presumption that you have interaction data in a particular format as a json file:

{"<user_1>": [
    {"id": 1,
    "user": "<user_1>",
    "timestamp": "datetime.datetime(2019, 8, 5, 16, 26, 36, 940000)",
    "action_type": "STORY_NAVIGATION",
    "action_name": "NARRATIVE_ELEMENT_CHANGE",
    "data": {
        "romper_type": "STORY_NAVIGATION",
        "romper_name": "NARRATIVE_ELEMENT_CHANGE",
        "romper_id": "",
        "romper_from_state": "null",
        "romper_to_state": "Intro Message",
        "current_narrative_element": "null",
        "current_representation": ""
    }}, {}],
"<user_N>": []}

The data snippet above is a single event, in a parsed format, recorded in an interactive experience. It is a dictionary of user and a list of events. If you're working with raw data extracted directly from an experience, then it will not be in the above format. As such, the library includes a utility method to convert raw data (from an SQL dump) into a usable format:

from interlib.util import to_dict
user_events = to_dict('path/to/json.json')

The to_dict function has additional parameters that can be passed, for example if you're dealing with a large amount of data then perhaps you'll want to split the data into chunks (useful for parallel processing):

user_events = to_dict('path/to/json.json', split = True) # splits into two, or;
user_events = to_dict('path/to/json.json', split = 4) # splits into four

While in some cases, you may want to only process a select group of users:

users = set([<user_1>, <user_2>]) # a set of user id strings
user_events = to_dict('path/to/json.json', users_to_include = users)

There are a range of other parameters that can be set, please explore: View to_dict function

Statistics

There are a range of statistics that can be extracted using the library:

  • Time: hidden time, session length, time to completion, raw session length, reach end;
  • Pauses: short (1 to 5 seconds), medium (6 to 15), long (16 to 30), very long (30+), as counts;
  • Events: counts and relative frequencies of interaction events;
  • Event Frequences: given a set of time thresholds, the frequency for each event in those thresholds

Again, the statistics package works under the assumption that you have parsed the raw events into the format described above.

If you want to calculate all statistics -- excluding event frequencies -- then do the following:

from interlib.preprocessing.statistics import Statistics

interaction_events = set(['NARRATIVE_ELEMENT_CHANGE', 'NEXT_BUTTON_CLICKED', ...]) # set of all user events you want to consider
stats = Statistics(
    user_events, # the user -> list of events dictionary
    completion_point = 'Make step 25', # the point in the experience determined to be the end
    n_jobs = -1 # the number of cores to run on (-1, which is a run on all available cores, is the default)
)
user_statistics = stats.calculate_statistics(interaction_events)
print(user_statistics)
{"<user_1>": {
        "hidden_time": 0.0, "time_to_completion": 1722.157, "reach_end": True, "raw_session_length": 1847.197,
        "session_length": 1847.197,  "SP": 4, "MP": 0, "LP": 4, "VLP": 25,
        "NEXT_BUTTON_CLICKED": 56, "BACK_BUTTON_CLICKED": 0, "VIDEO_SCRUBBED": 0, "VOLUME_CHANGED": 0,
        "REPEAT_BUTTON_CLICKED": 1, "BROWSER_VISIBILITY_CHANGE": 0, "SWITCH_VIEW_BUTTON_CLICKED": 2,
        "NARRATIVE_ELEMENT_CHANGE": 30, "FULLSCREEN_BUTTON_CLICKED": 0, "OVERLAY_BUTTON_CLICKED": 2,
        "SUBTITLES_BUTTON_CLICKED": 0, "PLAY_PAUSE_BUTTON_CLICKED": 0, "LINK_CHOICE_CLICKED": 0,
        "USER_SET_VARIABLE": 0, "total_events": 91},
"<user_N>": {}, }

If you want to calculate specific statistics, e.g. the time statistics, the library provides that option:

from interlib.preprocessing.statistics import Statistics

# create a Statistics object
stats = Statistics(
    user_events,
    completion_point = 'Make step 25', # without this, some statistics cannot be calculated (reached end and time to completion)
    n_jobs = -1
)
interaction_events = set(['NARRATIVE_ELEMENT_CHANGE', 'NEXT_BUTTON_CLICKED', ...]) # set of all user events you want to consider

# calculate the time statistics
time_statistics = stats.time_statistics()

# calculate pause statistics
pause_statistics = stats.pause_statistics()

# calculate event statistics
event_statistics = stats.event_statistics(
    interaction_events = interaction_events
)

# calculate event frequencies
event_frequencies = stats.event_frequencies(
    frequencies = [0, 60, 120, 180], # indicates that you want frequencues for minutes 0 to 1, 1 to 2, and 2 to 3.
    interaction_events = interaction_events
)

# You can also fetch just the session lengths from the user events
session_lengths = stats.calculate_session_length()

Note regarding n_jobs: When setting the n_jobs parameter, if you have a small dataset then use a single (1) core otherwise the default will result in slow performance. I would recommend incrementally increasing the parameter when the data size is over 2GB (i.e., 2 cores for 2 to 4GB, 3 cores for 4 to 6GB, 4+ cores for 6GB+). The parameter also sets how computation is performed throughout the extractor you're working with, i.e. if n_jobs = -1 in Statistics then all functions in that object will use -1 cores (all).

Sequences

An alternative data representation is sequences, where the events are processes into a common format and their temporal ordering is preserved. Before starting, you need to define both the interaction events that you want to include in the sequences and aliases (short-hand names):

interaction_events = set(["PLAY_PAUSE_BUTTON_CLICKED", "BACK_BUTTON_CLICKED", ...])
aliases = dict({"PLAY_PAUSE_BUTTON_CLICKED": "PP", "BACK_BUTTON_CLICKED": "BB", ...})

The way in which this extractor works is very similar to Statistics and there is a presumption that the events are in the same format as previously.

To extract the sequences (presumes you have already a loaded user_events object):

from interlib.preprocessing.sequences import Sequences

seq = Sequences(user_events) # set-up the Sequences object
user_sequences = seq.get_sequences(interaction_events, aliases)
print(user_sequences)
{
    "<user_1>": ["NEC", "MP", "PP", "VLP", "NB", "NB", "NEC", "VLP", ...],
    "<user_2>": [],
    "<user_N>": []
}

In some analysis cases, n-grams can prove a useful tool to represent sequences. As such, the library provides this option:

n_grams = seq.get_ngrams(n = 3) # extract tri-grams

Utility

While the above deals with extracting data representations and features from the data, the ultilty package provides some common functions that may come in handy while working with this type of data. It is by no means exhaustive and it's essentially common functions that I have found useful when processing the data in the past. The main function in util, to_dict, has already been covered.

from interlib.util import parse_raw_data, parse_timestamp, to_dataframe

Parsing Raw Data If you have a list of raw events, then you're able to parse these into a format that is recognisable by the library:

parsed_events = parse_raw_data(
    raw_data,
    datetime_format = "%Y-%m-%d %H:%M:%S.%f",
    include_narrative_element_id = False
)

Parsing timestamps When handling the raw data, a common task was to parse the timestamps into datetime objects from strings. This function, given a list of events in their raw format and with a timestamp element, parses the timestamps into datetime objects:

parsed_events = parse_timestamp(
    raw_data,
    datetime_format = "%Y-%m-%d %H:%M:%S.%f"
)

To DataFrame To start analysing the data, I recommend using pandas. To help, there is a utility function that can convert the output from the Statistics object into a usable dataframe.

df = to_dataframe(user_statistics)

Reference

Publish: TODO :)

About

Library for performing common tasks on interaction data

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages