Matrix muliplication is a good first example of code optimization for three reasons:
- It is ubiquitous
- It looks trivial
- The naive approach is orders of magnitude slower than the tuned version
Part of the point of this exercise is that it often makes sense to build on existing high-performance libraries when they are available.
The rest of this README describes the reference code setup.
The main files are:
README.md
: This fileMakefile
: The build rulesMakefile.in.*
: Platform-specific flag and library specs used in the Makefiledgemm_*
: Different modules implementing thesquare_dgemm
routinefdgemm.f
: Reference Fortrandgemm
from Netlib (c.f.dgemm_f2c
)matmul.c
: Driver script for testing and timingsquare_dgemm
versionsplotter.py
: Python script for drawing performance plotsrunner.sh
: Helper script for running the timer on the instructional nodes
You will probably mostly be looking at Makefile.in
and dgemm_*.c
. Note that "dgemm" stands for "Double Precision GEneral Matrix Multiply".
I have built the reference code with GCC and CLang on GCP.
You can switch between these options by adding PLATFORM=clang
(for
example) to your make
command, or by changing the PLATFORM=gcc
line at the top of the Makefile. For example, to build all the
drivers on my laptop, I run
make PLATFORM=mac
from the terminal. If someone feels like adding an autoconf or CMake script for these configurations, I would welcome it!
For those who aren't familiar with the Makefile system and would like an overview, please consult these two links: tutorial more in-depth tutorial
We are going to install several packages to support this code. For development,
I recommend the e2-micro
instance type that we considered in the GCP
walkthrough on 9/10 (with Debian Bookwork as the recommended operating
system). For timing, we will use the e2-highcpu-2
configuration, but you
do not need to use this for regular development.
With this setup, you will need the following packages:
build-essential
: GCC, Make, and various other basic tools for working with compiled codes.pkg-config
: Used to semi-automatically figure out where to look for various packages installed in the system (like the BLAS)gfortran
: You are going to want a Fortran compiler to be able to test out the Fortran version of the DGEMM. You may write your optimized code in Fortran using this type of interface, if you want, but it is not the default.clang
: The CLang compilerlibomp-dev
: Support library for OpenMP with CLanggit
: So you can fetch this repositorylibopenblas-dev
: The OpenBLAS library is a fast BLAS implementation for modern Intel processors.- The Python stack:
python3-numpy
,python3-scipy
,python3-pandas
, andpython3-matplotlib
I recommend also installing the following packages (which you may have already done):
llvm
: To get access to thellvm-mca
toolgoogle-perftools
: The Google performance tools for profiling
Once you have installed the compiler and the OpenBlas libraries,
building with GCC is as simple as typing make
(or make PLATFORM=gcc
). But it is worth poking through the Makefile.
Some of you use a Mac for development. I have included mac-gcc
and mac-clang
for you to use if you want, but you have to have
things installed right first. Note that you are in no way obliged to
use these things; I provide it solely for your own edification.
By default, the gcc
program in MacOS is not GCC at all; rather, it
is an alias for Clang. The driver code (matmul.c
) uses the OpenMP
omp_get_wtime
routine for timing; but, the Apple version of Clang
looks like it does not support OpenMP. At least, I thought there was
no OpenMP support until recently! Then I read this
article and
learned better. So assuming you have Homebrew installed, you can
build with make PLATFORM=mac-clang
provided that you first run the
line
brew install libomp
If you want to use the "real" GCC, make sure you do
brew install gcc gfortran
and then you can build with make PLATFORM=mac-gcc
.
For GCP, the Makefile is configured to link against OpenBLAS, a high-performance open-source BLAS library based on the Goto BLAS (as an aside, there is an excellent NYTimes article about the history behind Goto BLAS)
On OS X, the Makefile is configured to link against the Accelerate
framework with the veclib
tag.
The reference implementation includes three files for calling a
Fortran dgemm
from the C driver:
dgemm_f2c.f
: An interface file for converting from C to Fortran conventionsdgemm_f2c_desc.c
: A stub file defining the globaldgemm_desc
variable used by the driverfdgemm.f
: The Fortrandgemm
routine taken from the Netlib reference BLAS
The "old-school" way of mixing Fortran and C involve figuring out how
types map between the two languages and how the compiler does name
mangling. It's feasible, but a pain to maintain. If you want, you
can find many examples of this approach to mixed language programming
online, including in the sources for old versions of this assignment.
But it has now been over a decade since the Fortran 2003 standard,
which includes explicit support for C-Fortran interoperability. So
we're going this route, using the iso_c_binding
module in
dgemm_f2c.f
to wrap a Fortran implementation of dgemm
from the
reference BLAS.
Apart from knowing how to create a "glue" layer with something like
iso_c_binding
, one of the main things to understand if trying to mix
Fortran and C++ is that Fortran has its own set of required support
libraries and its own way of handling the main
routine. If you want
to link C with Fortran, the easiest way to do so is usually to compile
the individual modules with C or Fortran compilers as appropriate, then
use the Fortran compiler for linking everything together.
The code writes a sequence of timings to a CSV (comma-separated value) text file that can be loaded into a spreadsheet or array for further processing. By default, the name of the CSV file is based on the executable name; for example, running
./matmul-blocked
with no arguments produces the output file timing-blocked.csv
. You can
also provide the file name as an argument, i.e.
./matmul-blocked timing-blocked.csv
To run all the timers, you will probably want to use
make run
To clear up some of your workspace, use
make clean
which will remove executables and compute node logs (but not your code or benchmarks). To remove everything except for code, use
make realclean
You can produce timing plots by running
make plot
The plotter assumes that all the relevant CSV files are already in place.
You can also directly use the plotter.py
script. The plotter.py
script loads a batch of timings and turns them into a plot which is
saved to timing.pdf
. For example, running
./plotter.py basic blocked blas
or
python plotter.py basic blocked blas
will compare the contents of timing-basic.csv
, timing-blocked.csv
,
and timing-blas.csv
.
Please refer to these notes to get started. The notes discuss blocking, buffering, SSE instructions, and auto-tuning, among other optimizations. The Roofline Paper is also worth looking at.