Cloud technologies and the monetization of user tracking have brought big data in an easy-to-access form upon us, but the tools, practices, and general understanding on how to deal with it are lacking. In particular, such data are operational and traditional statistical and machine learning techniques are not appropriate. Such data are large and we can't expect to analyze it on our laptop using our favorite tool or language, but need to build software to do that.
Hence this course: or how do we move from software engineering and data analysis to evidence engineering.
Virtual cloud via docker containers
- Allows to access more resources transparently for a single application.
- Much more general that hadoop or spark
- Allows mixing various technologies on a single host or among hosts
Spark
- Automates breaking data into pieces
- Needs to store data in hdfs first: slow, tedious
- Can easily overwhelm the computation (e.g., collect)
- Performance issues
OpenMPI
- Allows larger RAM by sharing across hosts
- Quite difficult to use
Graph databases:
- Make relationship queries fast and convenient
- A lot of effort preparing and inputing data
- Immature, performance problems
Graph libraries: Boost
- Best performance
- Some routines compatible with OpenMPI
- Need to prepare data
Basic unix commands:
- sort - unbeatable performance with right parameters
- Flexible
- Enhance with awk/perl/python for associative arrays (has tables)
Text analysis:
- Forget LSI/LDA: doc2vec/word2vec rule
SQL databases:
- Don't work for large data
Understanding the limitations and use cases for various technologies is a must.
-
Planning data workflow
- Tradoffs among CPU/RAM/disk/network
- Processing steps
- Debugging steps (need to plan ahead)
- Various approaches to prototyping
- Resilience
-
Establishing data quality issues
- Identity of actors
- Digital signature of entities: file content, filenames, text messages, ...
-
Debugging noisy graphs
Unlike with small data, building a full application within one semester is a dream. A prototype, perhaps...
Many things need to be tried in parallel:
- Can necessary data be processed
- Would it be useful for an application (need to evaluate on a subset)
- How debugging will be conducted?
Example motivations for applications based on very large data of all public version control systems.
-
Relationship among developers based on common artifacts modified
- Different from traditional social metrics based on communication
- Can help find relationship among actors
- Can help link multiple identities of a single actor
-
Relationship among developers based on similar language
- Can find developers working on similar topics or projects
- Can help link multiple identities of a single actor
-
Finding suitable recruits for an OSS project
- Use the relationships based on code to identify communities and use them in conjunction with developer performance to make recommendations.
-
Orphanages: groups of no longer maintained projects
- To find maintainers before the owner leaves
- To identify alternatives in the supply chain
- To provide leads to volunteers willing to help