-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
added NPB kernels: CG, EP, FT, IS, MG #13
base: master
Are you sure you want to change the base?
Conversation
NPB/config/make.def
Outdated
#--------------------------------------------------------------------------- | ||
# This is the C compiler used for OpenMP programs | ||
#--------------------------------------------------------------------------- | ||
CC = mpicxx |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any reason you're not using dash-mpicxx
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, changed that.
- rm -r bin/* | ||
|
||
veryclean: clean | ||
- rm config/make.def config/suite.def Part* |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do you delete a file that is in git?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The veryclean
option was always part of the NPB I think, for people who want to delete even the config files... I don't use it, if you want, we can delete it and have only clean
and cleanall
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yay! Thanks for that effort, that had been on my list. Just two comments I made a second ago.
|
||
for (int unit_idx = 0; unit_idx < dash::size(); ++unit_idx) { | ||
local_sizes.push_back(my_elem_count[unit_idx]); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
indentation
NPB/CG/cg.cpp
Outdated
//now setup a and colidx | ||
|
||
typedef dash::CSRPattern<1> pattern_t; | ||
typedef int index_t; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
indentation
NPB/CG/cg.cpp
Outdated
|
||
if (0 == dash::myid()) { | ||
|
||
printf("\n\n NAS Parallel Benchmarks 4.0 C++ DASH version"" - CG Benchmark\n"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
indentation (we typically use 2 spaces)
NPB/CG/cg.cpp
Outdated
//now setup a and colidx | ||
|
||
typedef dash::CSRPattern<1> pattern_t; | ||
typedef int index_t; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you please typedef
the pattern (and also the dash::Matrix
using it) somewhere centrally so that it is easy to change?
Also: is it guranteed that the matrix will never have more than 2G elements?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and instead of "typedef" i would prefer "using"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I changed it to using
. We are storing not the whole matrix, only the nonzero elements which are about 36M in Class C, which is the largest problem size implemented here.
NPB/FT/ft.cpp
Outdated
} | ||
} | ||
|
||
if(0 == dash::myid()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
indentation
NPB/FT/ft.cpp
Outdated
// } | ||
int offset = k*NX*NY+(myoffset+j)*NX; | ||
dash::copy(&y0[k][0], &y0[k][fftblock], xout.begin()+offset+ii); | ||
// futs_w[k] = dash::copy_async(&y0[k][0], &y0[k][fftblock], xout.begin()+offset+ii); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not use copy_async
? I'd guess that allows multiple concurrent data streams, which should be faster...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That was my guess as well, but on my computer at home, it didn't help, that's why i commented it out. This can be changed by whoever uses these benchmarks.
NPB/MG/mg.cpp
Outdated
fut_bot = dash::copy_async(r.begin()+n2*n1*bottomcoord, r.begin()+n2*n1*(bottomcoord+1), &bottomplane[0]); | ||
} | ||
|
||
for(int i3 = 1; i3 < z_ext-1; i3++) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you need performance .local(....) is may be not the best solution. Here you might benefit of plain pointers
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here, I would say as well this is a topic which is left to be investigated when someone is using these benchmarks.
NPB/MG/mg.cpp
Outdated
if ( z(i3,0,0).is_local() ) { | ||
z_local[j3-start] = z.lbegin()+z_size*(i3-z_offset); | ||
is_local[j3-start] = true; | ||
futs[j3-start] = dash::copy_async(z.begin()+z_size*i3, z.begin()+z_size*i3, z_local_v[j3-start].data()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note: this call to copy_async
copies zero elements
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
futs[j3-start] = dash::copy_async(z.begin()+z_size*i3, z.begin()+z_size*i3, z_local_v[j3-start].data()); | |
futs[j3-start] = dash::copy_async(z.begin()+z_size*i3, z.begin()+z_size*(i3+1), z_local_v[j3-start].data()); |
NPB/MG/mg.cpp
Outdated
} | ||
if ( z(i3+1,0,0).is_local() ) { | ||
z_local[j3-start+1] = z.lbegin()+z_size*(i3+1-z_offset); | ||
futs[j3-start+1] = dash::copy_async(z.begin()+z_size*(i3+1), z.begin()+z_size*(i3+1), z_local_v[j3-start+1].data()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note: this call to copy_async
copies zero elements
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
futs[j3-start+1] = dash::copy_async(z.begin()+z_size*(i3+1), z.begin()+z_size*(i3+1), z_local_v[j3-start+1].data()); | |
futs[j3-start+1] = dash::copy_async(z.begin()+z_size*(i3+1), z.begin()+z_size*(i3+2), z_local_v[j3-start+1].data()); |
NPB/MG/mg.cpp
Outdated
if ( r(i3,0,0).is_local() ) { | ||
r_local[r_idx] = r.lbegin()+r_psize*(i3-r_offset); | ||
is_local[r_idx] = true; | ||
futs[r_idx] = dash::copy_async(r.begin()+r_psize*i3, r.begin()+r_psize*i3, r_local_v[r_idx].data()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
futs[r_idx] = dash::copy_async(r.begin()+r_psize*i3, r.begin()+r_psize*i3, r_local_v[r_idx].data()); | |
futs[r_idx] = dash::copy_async(r.begin()+r_psize*i3, r.begin()+r_psize*(i3+1), r_local_v[r_idx].data()); |
NPB/MG/mg.cpp
Outdated
if ( r(i3+1,0,0).is_local() ) { | ||
r_local[r_idx] = r.lbegin()+r_psize*(i3+1-r_offset); | ||
is_local[r_idx] = true; | ||
futs[r_idx] = dash::copy_async(r.begin()+r_psize*(i3+1), r.begin()+r_psize*(i3+1), r_local_v[r_idx].data()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
futs[r_idx] = dash::copy_async(r.begin()+r_psize*(i3+1), r.begin()+r_psize*(i3+1), r_local_v[r_idx].data()); | |
futs[r_idx] = dash::copy_async(r.begin()+r_psize*(i3+1), r.begin()+r_psize*(i3+2), r_local_v[r_idx].data()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is actually the work-around for setting the futures to true without having to do any calculations. Unfortunately, if we leave them out and just use the is_local[], it would give an even more ugly looking code when those elements are accessed. We couldn't find a better way to do it without working in DART or something similar.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is actually the work-around for setting the futures to true without having to do any calculations.
I'm not sure I understand your point. Why do you need to a future in the first place then if there is no transfer happening? Such a quirk should at least be documented, preferably removed altogether.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I looked it over and yes, you are right. I omitted the 0 element calls.
After browsing through some of the code, I have another major concern: some codes allocate global memory in the critical path, e.g., every timestep. Global memory allocation is costly (orders of magnitude slower than The reason I am concerned is this: if these benchmarks end up in the repo someone will eventually grab them and use them to compare their approach to DASH. They will not make an attempt to investigate why the performance of DASH seemingly sucks. We should be careful with putting out benchmarks where we cannot show that we are at least in the same ballpark as MPI. This would come back to haunt us... |
No, I did not test this on a real HPC system. Unfortunately, I'm working on different projects now and I don't have the time to test and work out the new global data-structures. If anybody wants to go ahead and do it, you're more than welcome. |
This is an implementation of the five original NPB kernels using DASH. We use several aspects of DASH, including DASH algorithms, CSR patterns and async_copy.