Motivation

Cloud technologies and the monetization of user tracking have brought big data in an easy-to-access form upon us, but the tools, practices, and general understanding on how to deal with it are lacking. In particular, such data are operational and traditional statistical and machine learning techniques are not appropriate. Such data are large and we can't expect to analyze it on our laptop using our favorite tool or language, but need to build software to do that.

Hence this course: or how do we move from software engineering and data analysis to evidence engineering.

The technology is key when dealing with big data

Virtual cloud via docker containers

Allows to access more resources transparently for a single application.
Much more general that hadoop or spark
Allows mixing various technologies on a single host or among hosts

Spark

Automates breaking data into pieces
Needs to store data in hdfs first: slow, tedious
Can easily overwhelm the computation (e.g., collect)
Performance issues

OpenMPI

Allows larger RAM by sharing across hosts
Quite difficult to use

Graph databases:

Make relationship queries fast and convenient
A lot of effort preparing and inputing data
Immature, performance problems

Graph libraries: Boost

Best performance
Some routines compatible with OpenMPI
Need to prepare data

Basic unix commands:

sort - unbeatable performance with right parameters
Flexible
Enhance with awk/perl/python for associative arrays (has tables)

Text analysis:

Forget LSI/LDA: doc2vec/word2vec rule

SQL databases:

Don't work for large data

The practices are totally different in big data

Understanding the limitations and use cases for various technologies is a must.

Planning data workflow
1. Tradoffs among CPU/RAM/disk/network
2. Processing steps
3. Debugging steps (need to plan ahead)
4. Various approaches to prototyping
5. Resilience
Establishing data quality issues
1. Identity of actors
2. Digital signature of entities: file content, filenames, text messages, ...
Debugging noisy graphs

Applications as Aims

Unlike with small data, building a full application within one semester is a dream. A prototype, perhaps...

Many things need to be tried in parallel:

Can necessary data be processed
Would it be useful for an application (need to evaluate on a subset)
How debugging will be conducted?

Example motivations for applications based on very large data of all public version control systems.

Relationship among developers based on common artifacts modified
1. Different from traditional social metrics based on communication
2. Can help find relationship among actors
3. Can help link multiple identities of a single actor
Relationship among developers based on similar language
1. Can find developers working on similar topics or projects
2. Can help link multiple identities of a single actor
Finding suitable recruits for an OSS project
1. Use the relationships based on code to identify communities and use them in conjunction with developer performance to make recommendations.
Orphanages: groups of no longer maintained projects
1. To find maintainers before the owner leaves
2. To identify alternatives in the supply chain
3. To provide leads to volunteers willing to help

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

motivation.md

motivation.md

Motivation

The technology is key when dealing with big data

The practices are totally different in big data

Applications as Aims

Files

motivation.md

Latest commit

History

motivation.md

File metadata and controls

Motivation

The technology is key when dealing with big data

The practices are totally different in big data

Applications as Aims