Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable multiple resolutions - first step #151

Open
wants to merge 12 commits into
base: develop
Choose a base branch
from

Conversation

lukasm91
Copy link
Collaborator

@lukasm91 lukasm91 commented Sep 16, 2024

This PR is a first step towards enabling running multiple resolutions (required for transi).

  • properly implementing trans_end/trans_release.
  • remove all the flattened arrays (in a first step replace them with ASSOCIATE to make changes less intrusive, we can clean the associates up in a second step)
  • propagate ncur_resol to fft and gemms to be able to cache multiple plans
  • clean up graphs/plan caches for FFT and gemms

We have to revert the ectrans-benchmark program before merge.

@lukasm91 lukasm91 marked this pull request as draft September 16, 2024 13:41
@lukasm91
Copy link
Collaborator Author

Present table in my sample program (currently "ectrans-benchmark", to be removed again before merge) after trans_release and before trans_end:

host:0x73d2840 device:0x331a00000 size:1180032 presentcount:0+1 line:46 name:alloc%ptr(:)
allocated block device:0x331a00000 size:1433600 thread:1

and after trans_end
Present table dump for device[1]: NVIDIA Tesla GPU 0, compute capability 9.0, threadid=1
Hint: specify 0x800 bit in NV_ACC_DEBUG for verbose info.
...empty...

@lukasm91 lukasm91 changed the title WIP: Enable multiple resolutions - first step Enable multiple resolutions - first step Sep 24, 2024
@lukasm91 lukasm91 marked this pull request as ready for review September 24, 2024 14:42
@samhatfield samhatfield added enhancement New feature or request gpu labels Oct 1, 2024
@samhatfield
Copy link
Collaborator

Nice work @lukasm91! This will probably take a while to review properly. In the mean time, has something gone wrong with a rebase? I see duplicated commits here from develop.

@lukasm91
Copy link
Collaborator Author

lukasm91 commented Oct 3, 2024

I think it is because I was merging with master. Do you prefer rebases? If yes, I might restructure the commits into three new commits or so. And yes, there are conflicts after the merge today, but I can easily resolve those.

@samhatfield
Copy link
Collaborator

Personally I have no preference of rebasing over merging, as long as commits aren't duplicated. It looks like that's the case, but perhaps I just don't understand how Github presents merging.

@lukasm91
Copy link
Collaborator Author

lukasm91 commented Oct 3, 2024

Good question, I am not sure what it was showing; maybe it was because I was starting from Willems branch (which was merged in the meantime), and then I merged develop later. In any case, I quickly restructured the commits into three commits (from which the last commit we will remove before merge; this is just for show-casing the usage). It is still correct that Willems transi branch can just be merged into this and it will work (except vordivtouv/uvtovordiv, which will we have to fix separately).

@wdeconinck
Copy link
Collaborator

I think it is because I was merging with master. Do you prefer rebases? If yes, I might restructure the commits into three new commits or so. And yes, there are conflicts after the merge today, but I can easily resolve those.

I do prefer rebases yes, and a cleaned up history :)

@lukasm91 lukasm91 force-pushed the cleanup-part1 branch 2 times, most recently from eb466a9 to 7ed5169 Compare October 14, 2024 08:43
Copy link
Collaborator

@wdeconinck wdeconinck left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a huge first step! It looks really nicely done!

Question on the ASSOCIATE statements. Are they required for OpenACC to work or is it to introduce minimal changes?

The ectrans-benchmark looks a little bit too inflated to my liking, though I understand the reasoning. Could we not create a smaller test that just tests that the resolutions are properly deallocated or uniquely using their own memory?

src/programs/ectrans-benchmark.F90 Outdated Show resolved Hide resolved
@lukasm91
Copy link
Collaborator Author

This is a huge first step! It looks really nicely done!

Question on the ASSOCIATE statements. Are they required for OpenACC to work or is it to introduce minimal changes?

The ectrans-benchmark looks a little bit too inflated to my liking, though I understand the reasoning. Could we not create a smaller test that just tests that the resolutions are properly deallocated or uniquely using their own memory?

Yes, I will revert the ectrans-benchmark. I just did that to test and show how it could be used. I think we will have to test it properly, through the transi tests, so that will happen. But I wanted to have you a look at how this can now be used.

The ASSOCIATE at this point are really to keep the changes to a minimum. In a second step, we can remove then, and we should then also check if they have a performance impact.

@wdeconinck
Copy link
Collaborator

What would be the performance cost of having arrays like ZAA in JPRD only, and reusable for single and double precision?
Then TPM_FIELDS_GPU could be in gpu_common.
I want to gauge how much of the memory can be reused (and avoided) between all CPU/GPU/single/double precision backends for the same grid-and-truncation.

@lukasm91
Copy link
Collaborator Author

lukasm91 commented Oct 16, 2024

The GEMMs can become very expensive, in terms of runtime but also memory footprint.Storage in the order of tenths of GB. I doubt that we want to store them in double precision on CPU, but even less on GPU. I think those arrays have to be stored with as little precision as possible.

@wdeconinck
Copy link
Collaborator

So then opposite approach, could it be in JPRM and still give desired results for double precision?

@lukasm91
Copy link
Collaborator Author

For the legendre coefficients, this could indeed be worth a test. We store one wave number in double precision for a reason, but I don't know if a "mixed precision" gemm would be sufficient in that case, too (Andreas or Sam might know the details). We do the single precision GEMM at a even lower precision, so who knows, it could be sufficient. For sure not something we should do in this PR, but I can only encourage experiments of doing certain things in lower precision.

@wdeconinck
Copy link
Collaborator

wdeconinck commented Oct 16, 2024

I am happy to hear you think it's an avenue worth exploring. That's what I wanted to get to. This is of course out of scope of this PR.
@samhatfield do you think you would have time to explore this avenue?
Certainly it won't be bit-identical for double precision anymore, but much accuracy is lost...?
My guess would be that it is quite a big drop compared to all in double precision, and essentially just be a single-precision transform.

@samhatfield
Copy link
Collaborator

I think the question being asked is: what would be the implications of doing DGEMM for m=0 and SGEMM for m>0 in the dp version of ecTrans, like is already done in the sp version? Is that right?

@wdeconinck
Copy link
Collaborator

wdeconinck commented Oct 29, 2024

https://www.openacc.org/sites/default/files/inline-files/OpenACC.2.7.pdf section 2.14.1 (p62) is about !$acc init

If CrayFortran does not support this, then probably following guard should be used?
#if defined(_OPENACC) && !defined(_CRAYFTN)

@samhatfield
Copy link
Collaborator

https://www.openacc.org/sites/default/files/inline-files/OpenACC.2.7.pdf section 2.14.1 (p62) is about !$acc init

If CrayFortran does not support this, then probably following guard should be used? #if defined(_OPENACC) && !defined(_CRAYFTN)

Yeah that should do the trick.

@reuterbal
Copy link
Contributor

https://www.openacc.org/sites/default/files/inline-files/OpenACC.2.7.pdf section 2.14.1 (p62) is about !$acc init
If CrayFortran does not support this, then probably following guard should be used? #if defined(_OPENACC) && !defined(_CRAYFTN)

Yeah that should do the trick.

According to man intro_openacc, Cray supports acc_init. Would that be an alternative?

@wdeconinck
Copy link
Collaborator

wdeconinck commented Oct 29, 2024

Cray supports acc_init. Would that be an alternative?

If it works, then yes :)

@samhatfield
Copy link
Collaborator

acc_init must take a device type argument. Can I just do

call acc_init(acc_get_device_type())

?

[ 97%] Building Fortran object ectrans/src/programs/CMakeFiles/ectrans-benchmark-gpu-dp.dir/ectrans-benchmark.F90.o
Global is external, but doesn't have external or weak linkage!
ptr @"str2int$ectrans_benchmark_"
Global is external, but doesn't have external or weak linkage!
ptr @"print_help$ectrans_benchmark_"
LLVM ERROR: Broken module found, compilation aborted!
ftn-2116 ftn: INTERNAL
@lukasm91
Copy link
Collaborator Author

lukasm91 commented Oct 30, 2024

The second error above only occurs when building the GPU version, and can be resolved by moving declarations of str2int and print_help above their first usage. Weird...

This sounds weird that it only shows up now, but we ICE can be caused as surprizing side effects ... I moved the code now. Same in the other program benchmark?

@lukasm91
Copy link
Collaborator Author

Following changes have been done now:

  • The suggestions are applied
  • !$acc init => acc_init is done (I am checking if this is equivalent but I am almost sure it is)
  • I went through the files @samhatfield suggested and added the D/R/... as I have seen usage, this needs another test of course.

One general suggestion: If you have more comments, even if they apply to multiple places, can you just add a comment in one case, then I can reply one-by-one and we have threads to discuss. This might be easier.

@samhatfield
Copy link
Collaborator

Just a couple more tweaks. PR #169 has slightly screwed up this branch - sorry about that @lukasm91. You'll have to rebase against develop again.

@samhatfield
Copy link
Collaborator

Unfortunately this branch (with the tweaks I just suggested) doesn't run on MI250x (LUMI-G) so we'll have to rectify that before we can merge this. Firstly I see many instances of this warning:

ACC: libcrayacc/acc_datatype.c:1799 CRAY_ACC_WARNING - Asynchronous transfer not supported for deep transfers - falling back to a synchronous transfer

and eventually the benchmark program crashes with HIPFFT_PARSE_ERROR. My suspicion is that the Cray compiler still doesn't like it when you put derived types in DATA statements. We encountered this before. Of course this branch does away with flattened derived types so it's natural this issue would come up again. I'll have a think if there's a minimally invasive workaround we can implement for Cray/AMD.

@lukasm91
Copy link
Collaborator Author

I will rebase, yes. But regarding the problem with Cray Compilers - isn't this just a warning, could we put ASYNC into ifdefs for the problematic instances? I can't reallz help with the parse error, but I don't think this can be related to derived types? Have you checked if it is actually using the not inplace FFT, because I think this caused the same error?

@samhatfield
Copy link
Collaborator

Yeah, I'll take some time this week to look closer and see what's going on.

@samhatfield
Copy link
Collaborator

And LUMI just announced a 4-day service break :)

@samhatfield samhatfield reopened this Nov 18, 2024
wdeconinck added a commit to wdeconinck/ectrans that referenced this pull request Nov 19, 2024
@samhatfield
Copy link
Collaborator

Sorry, I'm not making any progress understanding the issue on Cray systems. I'm using Frontier to debug this. If I build ecTrans and FIAT "manually" through CMake invocations etc., everything seems to work. If I build both using an ecBundle-based system, as I usually do, I get the warning mentioned above and either this error message:

ACC: libcrayacc/acc_present.c:868 CRAY_ACC_ERROR - Error placing variable 'f' (0x1a5bf80 - 0x1a5c480) from ectrans/src/trans/gpu/generated/ectrans_gpu_sp/external/setup_trans.F90:525 on the device. Another instance of the address range (0x1a5bf80 - 0x1a5c480) is already present.

OR it does "work" but gives spectral error norms of ~ 1.0, which looks like at some point the spectral fields are just being zeroed. I've seen both these behaviours while debugging but I don't understand fully the conditions under which each one occurs. Annoying.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request gpu
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants