Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

@floop @reduce produces correct results for SequentialEx(), incorrect for ThreadedEx() and CUDAEx(). #58

Open
stillyslalom opened this issue Dec 20, 2020 · 0 comments

Comments

@stillyslalom
Copy link

I've been investigating the Folds ecosystem while trying to find an efficient way to get the values and indices of the two largest elements in a GPU array. Base.foldl works like a charm on the CPU:

function findmax2(prev, curr)
    i, v = curr
    v > prev.first.v && return (first=(; i, v), second=prev.first)
    v > prev.second.v && return (first=prev.first, second=(; i, v))
    return prev
end
julia> A = zeros(Float32, 512, 512);

julia> A[1] = 1.0f0; A[end] = 0.5f0;

julia> foldl(findmax2, enumerate(A), init=(first=(i=0, v=0.0f0), second=(i=0, v=0.0f0)))
(first = (i = 1, v = 1.0f0), second = (i = 262144, v = 0.5f0))

It's a bit trickier with FLoops.jl, but I came up with a working single-threaded CPU implementation while trying to get a GPU version. I expect the same thing is causing it to fail for both multithreaded CPU and GPU.

using CUDA 
using Transducers, FoldsCUDA, FLoops

function folds_findmax2(xs, ex = xs isa CuArray ? CUDAEx() : ThreadedEx())
    xtypemin = typemin(eltype(xs))
    @floop ex for (i, x) in zip(eachindex(xs), vec(xs))
        @reduce() do (xmax1=xtypemin; x), (imax1=-1; i)
            if isless(xmax1, x)
                imax2, xmax2 = imax1, xmax1
                imax1, xmax1 = i, x
            end
        end
        i == imax1 && continue
        @reduce() do (xmax2=xtypemin; x), (imax2=-1; i)
            if isless(xmax2, x)
                imax2, xmax2 = i, x
            end
        end
    end
    return ((imax1, xmax1), (imax2, xmax2))
end
julia> folds_findmax2(A, SequentialEx())
((1, 1.0f0), (262144, 0.5f0)) # correct!

julia> folds_findmax2(A, ThreadedEx())
((1, 1.0f0), (2, 0.0f0)) # wrong

julia> folds_findmax2(dA)
((1, 1.0f0), (2, 0.0f0)) # wrong

This is on Julia 1.5.2 with Transducers, FLoops, and FoldCUDA on latest master.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant