- ASPLOS'17-Locality-Aware CTA Clustering for Modern GPUs
- ASPLOS'17-Dynamic Resource Management for Efficient Utilization of Multitasking GPUs
- HPCA'17-Dynamic GPGPU Power Management Using Adaptive Model Predictive Control
- ISCA'16-Transparent Offloading and Mapping (TOM): Enabling Programmer-Transparent Near-Data Processing in GPU Systems
- HPCA'17-Controlled Kernel Launch for Dynamic Parallelism in GPUs
- ISCA'16-LaPerm: Locality Aware Scheduler for Dynamic Parallelism on GPUs
- ISCA'16-Virtual Thread Maximizing Thread-Level Parallelism beyond GPU Scheduling Limit
- Berkeley TechRpts'16-Understanding Latency Hiding on GPUs
- GTC'17-COOPERATIVE GROUPS
- ISCA'16-APRES: Improving Cache Efficiency by Exploiting Load Characteristics on GPUs
- SC'15-Adaptive and Transparent Cache Bypassing for GPUs
- HPCA'17-Towards Pervasive and User Satisfactory CNN across GPU Microarchitectures
- ASPLOS'14-Paraprox: Pattern-Based Approximation for Data Parallel Applications
- GTC'18-Dissecting the NVIDIA Volta GPU Architecture via Microbenchmarking
- PLDI'18-GPU Code Optimization using Abstract Kernel Emulation and Sensitivity Analysis
- CGO'18-CUDAAdvisor: LLVM-based runtime profiling for modern GPUs
- CCGRID'18-Exposing Hidden Performance Opportunities in High Performance GPU Applications
- Euro-Par'15-Identifying Optimization Opportunities Within Kernel Execution in GPU Codes
- SC'13-Effective sampling-driven performance tools for GPU-accelerated supercomputers
- ISPASS'12-Lynx: A dynamic instrumentation system for data-parallel applications on GPGPU architectures
- ICPP'11-Parallel Performance Measurement of Heterogeneous Parallel Systems with GPUs
- ISPASS'10-Demystifying GPU Microarchitecture through Microbenchmarking
- ISPASS'10-Visualizing Complex Dynamics in Many-Core Accelerator Architectures
- ISPASS'09-Analyzing CUDA Workloads Using a Detailed GPU Simulator
- Performance Analysis and Tuning for General Purpose Graphics Processing Units (GPGPU)
- Monitoring Heterogeneous Applications with the OpenMP Tools Interface
- ECP'19-Performance Tuning of Scientific Codes with the Roofline Model
- GTC'18-VOLTA Architecture and performance optimization
- SC'10-Fundamental_Optimizations
- Vampir|Score-P
- TAU
- PAPI
- Allinea MAP
- Open|SpeedShop
- HPCToolkit
- NVIDIA Nsight Systems
- NVIDIA Nsight Compute
- LLVM'17-Implementing implicit OpenMP data sharing on GPUs
- CGO'16-gpucc: An Open-Source GPGPU Compiler
- LLVM'16-Offloading Support for OpenMP in Clang and LLVM
- PMBS'15-Performance Analysis of OpenMP on a GPU using a CORAL Proxy Application
- LLVM'15-Integrating GPU Support for OpenMP Offloading Directives into Clang
- LLVM'14-Coordinating GPU Threads for OpenMP 4.0 in LLVM
- Ampere-NVIDIA A100 Tensor Core GPU Architecture
- Turing-NVIDIA TURING GPU ARCHITECTURE
- Volta-NVIDIA TESLA V100
- Pascal-NVIDIA TESLA P100
- Kepler-NVIDIA’s Next Generation CUDA Compute Architecture: Kepler
- Fermi-NVIDIA’s Next Generation CUDA Compute Architecture: Fermi
- CUDA Toolkit Documentation-CUDA Toolkit Documentation