12 weeks, 2 hours / per week
20 min per episode, so six episodes per week.
This course will cover:
***** Spark MLlib
**** ML Pipeline and GraphX
*** Spark Core and Spark SQL
** Spark Streaming
* Scikit-learn for reference.
- Advanced Analytics with Spark
- Machine Learning with Spark
- The Lion Way: Machine Learning plus Intelligent Optimization
- Others...
- Spark ABC
- Machine learning ABC
- Graph Computing ABC
- Demos for Spark, MLlib, and GraphX
- Logistic regression
- Linear regression
- SVM
- LASSO
- Ridge regression
- Applied demos such as Handwritten digits recognition, etc.
- Recommendation ALS
- Singular Value Decomposition
- The implementation in both MLlib and Mahout
- Applied demo of recommendation with PredictionIO.
- k-means
- LDA
- Applied demo of geo-location clustering and topic modeling
- Lambda Architecture
- Parameter Server
- Several algorithms from Freeman labs
- Applied demo such as the zebrafish experiment
- Pipeline of Scikit-learn
- Pipeline of Spark (DataFrame, ML Pipeline, etc.)
- Applied demo (TBD)
- Scientific computing and Notices from Matrix Computation
- Matrix libs (in C/Fortran and Java)
- Matrix in MLlib
- Applied demo (TBD)
- Graph computing and libs
- revisit LDA, ALS
- Applied demo such as community detection for food network/recommendation.
- Tree model
- Random forest
- Ensemble in Kaggle and practice
- Applied demo for ensemble
- Evaluation methods
- Implementations in MLlib
- Online / Offline evaluations
- Commonly used optimization algorithms
- Sequential gene of optimization algorithms
- BSP model to BSP+ model to SSP
- Future ways?
- One, two, three of practical ML
- Rethink of practical machine learning
- How to build a great machine learning system?
- Compare with Mahout / Oryx2 / VM / ...
| Chapter | Topic | Algorithms | Dataset | Source | |:-----:|:-----:|:-----:|:-----:|:-----:|:-----:| | 2 | Record Linkage | Entity resolution, record dedup, merge-and-purge, list washing | Some business data such as TCPDS | UCI ML repo | | 3 | Recommending | ALS | Who plays what or who rates what | Audioscrobbler | | 4 | Predicting Forest Cover | Decision Tree | The type of forest covering parcels of land in Colorado | UCI ML repo | | 5 | Anomaly detection in network traffic | K-means | Network intrusion data | KDD Cup 1999 Dataset | | 6 | Understanding wikipedia | Latent Semantic Analysis, SVD, TF-IDF, etc | wikipedia texts | wikipedia | | 7 | Analyzing Co-occurrence Networks | Massive graph algorithms in GraphX | MEDLINE citation index | US National Library of Medicine | | 8 | Geo and Temporal data analysis | Building sessions | New York Taxicab Data | New York City Taxi and Limousine Commission | | 9 | Estimating Finacial Risk | Monte Carlo Simulation | Stock Data | Yahoo! | | 10 | Analyzing Genomic Data | Massive genome analysis algorithms | Genome data | NCBI | | 11 | Analyzing Neuroimaging Data | Thunder | Images of zebrafish brains | Thunder repository |
/src/chapterx --> The code snippets of each chapter
/src/chapterx/{java, python, scala} --> Code snippets written with Mahout, Scikit-learn, and Spark
Type | Algorithm | Scikit-learn | Spark |
---|---|---|---|
Classification | Logistic Regression | YES | YES |
Classification | Perceptron | YES | |
Classification | Passive Aggressive Algorithms | YES | |
Classification | SVM | YES | YES |
Classification | Naive Bayes | YES | YES |
Classification | Decision Tree | YES | YES |
Classification | Ensemble methods | YES | YES |
Classification | Label Propogation | YES | YES (in GraphX) |
Classification | LDA and QDA | YES | |
Regression | Ordinary Least Square | YES | YES |
Regression | Ridge Regression | YES | YES |
Regression | LASSO | YES | YES |
Regression | Elastic Net | YES | |
Regression | Multi-task LASSO | YES | |
Regression | Least Angle Regression | YES | |
Regression | LARS LASSO | YES | |
Regression | Orthogonal Matching Pursuit | YES | |
Regression | Bayesian Regression | YES | |
Regression | Polynomial Regression | YES | |
Regression | Nearest Neighbor | YES | YES |
Regression | Gaussian Process | YES | |
Regression | Isotonic Regression | YES | |
Clustering | K-means | YES | YES |
Clustering | Affinity Propagation | YES | |
Clustering | Mean shift | YES | |
Clustering | Spectral Clustering | YES | |
Clustering | Ward | YES | |
Clustering | Agglomerative clustering | YES | |
Clustering | DBSCAN | YES | |
Clustering | Gaussian Mixtures | YES | |
Dimension Reduction | PCA | YES | YES |
Dimension Reduction | SVD / LSA | YES | YES |
Dimension Reduction | Dictionary Learning | YES | |
Dimension Reduction | Factor Analysis | YES | |
Dimension Reduction | ICA | YES | |
Dimension Reduction | NMF | YES | |
Model Selection | Cross Validation | YES | YES |
Model Selection | Grid Search | YES | |
Model Selection | Pipeline | YES | YES |
Model Selection | Feature Union | YES | YES |
Model Selection | Model Evaluation | YES | YES |
Model Selection | Model Presistence | YES | |
Model Selection | Validation Curves | YES | |
Preprocessing | Standardization | YES | YES |
Preprocessing | Encoding categorical features | YES | YES (dependency) |
Preprocessing | Binarization | YES | |
Preprocessing | Normalization | YES | YES |
Preprocessing | Label preprocessing | YES | |
Preprocessing | Imputation of missing values | YES | |
Preprocessing | Unsupervised data reduction | YES |