Skip to content

Latest commit

 

History

History
222 lines (193 loc) · 15.3 KB

README.md

File metadata and controls

222 lines (193 loc) · 15.3 KB

Deep Learning Project Template

This template offers a lightweight yet functional project template for various deep learning projects. The template assumes PyTorch as the deep learning framework. However, one can easily transfer and utilize the template to any project implemented with other frameworks.

Table of Contents

Getting Started

You can fork this repo and use it as a template when creating a new repo on Github like this:

Or directly use the template from the forked template repo like this:

Alternatively, you can simply download this repo in zipped format and get started:

Next, you can install all the dependencies by typing the following command in project root:

conda careate -n project_name python=3.8
conda install poetry  # or 'pip install poetry'
poetry new project_name

Finally, you can wrap up the setup by manually install and update any packages you'd like. Please refer to the Extra Packages section for some awesome packages.

Template Layout

dl-project-template
.
|
├── LICENSE.md
├── README.md
├── makefile            # makefile for various commands (install, train, pytest, mypy, lint, etc.) 
├── mypy.ini            # MyPy type checking configurations
├── pylint.rc           # Pylint code quality checking configurations
├── pyproject.toml      # Poetry project and environment configurations
|
├── data
|   ├── ...             # data reference files (index, readme, etc.)
│   ├── raw             # untreated data directly downloaded from source
│   ├── interim         # intermediate data processing results
│   └── processed       # processed data (features and targets) ready for learning
|
├── notebooks           # Jupyter Notebooks (mostly for data processing and visualization)
│── src    
│   ├── data            # data processing classes, functions, and scripts
│   ├── evaluations     # evaluation classes and functions (metrics, visualization, etc.)
│   ├── experiments     # experiment configuration files
│   ├── modules         # activations, layers, modules, and networks (subclass of torch.nn.Module)
│   └── utilities       # other useful functions and classes
├── tests               # unit tests module for ./src
│
├── docs                # documentation files (*.txt, *.doc, *.jpeg, etc.)
├── logs                # logs for deep learning experiments
└── models              # saved models with optimizer states

Extra Packages

Data Analysis, Augmentation, Validation and Cleaning

  • Great Expectation: data validation, documenting, and profiling
  • Cerberus: lightweight data validation functionality
  • PyJanitor: Pandas extension for data cleaning
  • PyDQC: automatic data quality checking
  • Feature-engine: transformer library for feature preparation and engineering
  • pydantic: data parsing and validation using Python type hints
  • Dora: exploratory data analysis toolkit for Python
  • datacleaner: automatically cleans data sets and readies them for analysis
  • whale: a lightweight data discovery, documentation, and quality engine for data warehouse
  • bamboolib: a tool for fast and easy data exploration & transformation of pandas DataFrames
  • pandas-summary: an extension to pandas dataframes describe function
  • AugLy: a data augmentations library for audio, image, text, and video.

Performance and Caching

  • Numba: JIT compiler that translates Python and NumPy to fast machine code
  • CuPy: NumPy-like API accelerated with CUDA
  • Dask: parallel computing library
  • Ray: framework for distributed applications
  • Modin: parallelized Pandas with Dask or Ray
  • Vaex: lazy memory-mapping dataframe for big data
  • Joblib: disk-caching and parallelization
  • RAPIDS: GPU acceleration for data science
  • Polars: a blazingly fast DataFrames library implemented in Rust & Python

Data Version Control and Workflow

  • DVC: data version control system
  • Pachyderm: data pipelining (versioning, lineage/tracking, and parallelization)
  • d6tflow: effective data workflow
  • Metaflow: end-to-end independent workflow
  • Dolt: relational database with version control
  • Airflow: platform to programmatically author, schedule and monitor workflows
  • Luigi: dependency resolution, workflow management, visualization, etc.

Visualization and Presentation

  • Seaborn: data visualization based on Matplotlib
  • HiPlot: interactive high-dimensional visualization for correlation and pattern discovery
  • Plotly.py: interactive browser-based graphing library
  • Altair: declarative visualization based on Vega and Vega-Lite
  • TabPy: Tableau visualizations with Python
  • Chartify: easy and flexible charts
  • Pandas-Profiling: HTML profiling reports for Pandas DataFrames
  • missingno: toolset of flexible and easy-to-use missing data visualizations and utilities
  • Yellowbrick: Scikit-Learn visualization for model selection and hyperparameter tuning
  • FlashTorch: visualization toolkit for neural networks in PyTorch
  • Streamlit: turn data scripts into sharable web apps in minutes
  • python-tabulate: pretty-print tabular data in Python, a library and a command-line utility
  • Lux: Python API for intelligent visual data discovery
  • bokeh: interactive data visualization in the browser, from Python

Project Lifecycles and Hyperparameter Optimization

  • NNI: automate ML/DL lifecycle (feature engineering, neural architecture search, model compression and hyperparameter tuning)
  • Comet.ml: self-hosted and cloud-based meta machine learning platform for tracking, comparing, explaining and optimizing experiments and models
  • MLflow: platform for ML lifecycle , including experimentation, reproducibility and deployment
  • Optuna: automatic hyperparameter optimization framework
  • Hyperopt: serial and parallel optimization
  • Tune: scalable experiment execution and hyperparameter tuning
  • Determined: deep learning training platform
  • Aim: a super-easy way to record, search and compare 1000s of ML training runs
  • TPOT: a Python Automated Machine Learning tool that optimizes machine learning pipelines using genetic programming

Distribution, Pipelining, and Sharding

  • torchgpipe: a scalable pipeline parallelism library, which allows efficient training of large, memory-consuming models
  • PipeDream: generalized pipeline parallelism for deep neural network training
  • DeepSpeed: a deep learning optimization library that makes distributed training easy, efficient, and effective
  • Horovod: a distributed deep learning training framework
  • RaySGD: lightweight wrappers for distributed deep learning
  • AdaptDL: a resource-adaptive deep learning training and scheduling framework

Other PyTorch Extensions

  • Ignite: high-level library based on PyTorch
  • PyTorch Lightning: lightweight wrapper for less boilerplate
  • fastai: out-of-the-box tools and models for vision, text, and other data
  • Skorch: Scikit-Learn interface for PyTorch models
  • PyRo: deep universal probabilistic programming with PyTorch
  • Kornia: differentiable computer vision library
  • DGL: package for deep learning on graphs
  • PyGeometric: geometric deep learning extension library for PyTorch
  • PyTorch-BigGraph: a distributed system for learning graph embeddings for large graphs
  • Torchmeta: datasets and models for few-shot-learning/meta-learning
  • PyTorch3D: library for deep learning with 3D data
  • learn2learn: meta-learning model implementations
  • higher: higher-order (unrolled first-order) optimization
  • Captum: model interpretability and understanding
  • PyTorch summary: Keras style summary for PyTorch models
  • Catalyst: PyTorch framework for Deep Learning research and development
  • Poutyne: a simplified framework for PyTorch and handles much of the ea code needed to train neural networks

Miscellaneous

  • Awesome-Pytorch-list: a comprehensive list of pytorch related content on github,such as different models,implementations,helper libraries,tutorials etc.
  • DoWhy: causal inference combining causal graphical models and potential outcomes
  • CausalML: a suite of uplift modeling and causal inference methods using machine learning algorithms based on recent research
  • NetworkX: creation, manipulation, and study of complex networks/graphs
  • Gym: toolkit for developing and comparing reinforcement learning algorithms
  • Polygames: a platform of zero learning with a library of games
  • Mlxtend: extensions and helper modules for data analysis and machine learning
  • NLTK: a leading platform for building Python programs to work with human language data
  • PyCaret: low-code machine learning library
  • dabl: baseline library for data analysis
  • OGB: benchmark datasets, data loaders and evaluators for graph machine learning
  • AI Explainability 360: a toolkit for interpretability and explainability of datasets and machine learning models
  • SDV: synthetic data generation for tabular, relational, time series data
  • SHAP: game theoretic approach to explain the output of any machine learning mode
  • TextBlob: a Python (2 and 3) library for processing textual data

Resources

Datasets:

  • Google Datasets: high-demand public datasets
  • Google Dataset Search: a search engine for freely-available online data
  • OpenML: online platform for sharing data, ML algorithms and experiments
  • DoltHub: data collaboration with Dolt
  • OpenBlender: live-streamed open data sources
  • Data Portal: a comprehensive list of open data portals from around the world
  • Activeloop: unstructured dataset management for TensorFlow/PyTorch

Libraries:

Readings:

Other ML/DL Templates:

Authors

  • Xiaotian Duan (Email: xduan7 at gmail.com)

License

This project is licensed under the MIT License - see the LICENSE.md file for more details.