Thrust 1.16.0 #1616

alliepiper · 2022-02-08T19:35:04Z

alliepiper
Feb 8, 2022
Maintainer

Summary

Thrust 1.16.0 provides a new “nosync” hint for the CUDA backend, as well as numerous bugfixes and stability improvements.

New `thrust::cuda::par_nosync` Execution Policy

Most of Thrust’s parallel algorithms are fully synchronous and will block the calling CPU thread until all work is completed. This design avoids many pitfalls associated with asynchronous GPU programming, resulting in simpler and less-error prone usage for new CUDA developers. Unfortunately, this improvement in user experience comes at a performance cost that often frustrates more experienced CUDA programmers.

Prior to this release, the only synchronous-to-asynchronous migration path for existing Thrust codebases involved significant refactoring, replacing calls to thrust algorithms with a limited set of future-based thrust::async algorithms or lower-level CUB kernels. The new thrust::cuda::par_nosync execution policy provides a new, less-invasive entry point for asynchronous computation.

par_nosync is a hint to the Thrust execution engine that any non-essential internal synchronizations should be skipped and that an explicit synchronization will be performed by the caller before accessing results.

While some Thrust algorithms require internal synchronization to safely compute their results, many do not. For example, multiple thrust::for_each invocations can be launched without waiting for earlier calls to complete:

// Queue three `for_each` kernels:
thrust::for_each(thrust::cuda::par_nosync, vec1.begin(), vec1.end(), Op{});
thrust::for_each(thrust::cuda::par_nosync, vec2.begin(), vec2.end(), Op{});
thrust::for_each(thrust::cuda::par_nosync, vec3.begin(), vec3.end(), Op{});

// Do other work while kernels execute:
do_something();

// Must explictly synchronize before accessing `for_each` results:
cudaDeviceSynchronize();

Thanks to @fkallen for this contribution.

Deprecation Notices

CUDA Dynamic Parallelism Support

A future version of Thrust will remove support for CUDA Dynamic Parallelism (CDP).

This will only affect calls to Thrust algorithms made from CUDA device-side code that currently launches a kernel; such calls will instead execute sequentially on the calling GPU thread instead of launching a device-wide kernel.

Breaking Changes

Thrust 1.14.0 included a change that aliased the cub namespace to thrust::cub. This has caused issues with ambiguous namespaces for projects that declare using namespace thrust; from the global namespace. We recommend against this practice.
Reduce header bloat #1572: Removed several unnecessary header includes. Downstream projects may need to update their includes if they were relying on this behavior.

New Features

Add execution policy thrust::cuda::par_nosync #1568: Add thrust::cuda::par_nosync policy. Thanks to @fkallen for this contribution.

Enhancements

Use CUB version of merge sort #1511: Use CUB’s new DeviceMergeSort API and remove Thrust’s internal implementation.
Updated thrust shuffle to use improved bijective function #1566: Improved performance of thrust::shuffle. Thanks to @djns99 for this contribution.
Support user defined CMAKE_INSTALL_INCLUDEDIR values #1584: Support user-defined CMAKE_INSTALL_INCLUDEDIR values in Thrust’s CMake install rules. Thanks to @robertmaynard for this contribution.

Bug Fixes

Fix some minor issues impacting the Intel compiler. #1496: Fix some issues affecting icc builds.
Fix some min/max macro collisions with windows.h #1552: Fix some collisions with the min/max macros defined in windows.h.
Fix 32-bit MSVC builds. #1582: Fix issue with function type alias on 32-bit MSVC builds.
Workaround nvcxx compiler error. #1591: Workaround issue affecting compilation with nvc++.
Add small check for header tests #1597: Fix some collisions with the small macro defined in windows.h.
Fix version checks in CMake packages. #1599, Ensure that the same version of CUB is found. #1603: Fix some issues with version handling in Thrust’s CMake packages.
Clarify scan non-determinism in the documentation #1614: Clarify that scan algorithm results are non-deterministic for pseudo-associative operators (e.g. floating-point addition).

This discussion was created from the release Thrust 1.16.0.

neoblizz · 2022-02-11T06:18:35Z

neoblizz
Feb 11, 2022

par_nosync is a hint to the Thrust execution engine that any non-essential internal synchronizations should be skipped and that an explicit synchronization will be performed by the caller before accessing results.

Great pre-release, thank you!

Is there a way to pass a stream to thrust::cuda::par_nosync, much like how you could do it with thrust::cuda::par.on(stream)? The idea being to asynchronously launch kernels on different streams using different CPU threads in parallel, and then eventually the caller joins them without needing to synchronize the whole device (like done in the for each example above), but instead synchronizing just the streams involved.

2 replies

alliepiper Feb 11, 2022
Maintainer Author

Absolutely -- I just omitted the streams for brevity in the release notes 🙂

par_nosync works with streams just like the regular CUDA exec policies, see this example.

neoblizz Feb 11, 2022

Awesome! 😊 I think this feature resolves https://github.com/NVIDIA/thrust/issues/1416.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Thrust 1.16.0 #1616

{{title}}

Replies: 1 comment 2 replies

{{title}}

{{title}}

{{title}}

Select a reply

Thrust 1.16.0 #1616

alliepiper Feb 8, 2022 Maintainer

Summary

New thrust::cuda::par_nosync Execution Policy

Deprecation Notices

CUDA Dynamic Parallelism Support

Breaking Changes

New Features

Enhancements

Bug Fixes

Replies: 1 comment · 2 replies

neoblizz Feb 11, 2022

alliepiper Feb 11, 2022 Maintainer Author

neoblizz Feb 11, 2022

alliepiper
Feb 8, 2022
Maintainer

New `thrust::cuda::par_nosync` Execution Policy

Replies: 1 comment 2 replies

neoblizz
Feb 11, 2022

alliepiper Feb 11, 2022
Maintainer Author