Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

added NPB kernels: CG, EP, FT, IS, MG #13

Open
wants to merge 6 commits into
base: master
Choose a base branch
from

Conversation

Mietzsch
Copy link

This is an implementation of the five original NPB kernels using DASH. We use several aspects of DASH, including DASH algorithms, CSR patterns and async_copy.

#---------------------------------------------------------------------------
# This is the C compiler used for OpenMP programs
#---------------------------------------------------------------------------
CC = mpicxx
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any reason you're not using dash-mpicxx?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, changed that.

- rm -r bin/*

veryclean: clean
- rm config/make.def config/suite.def Part*
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do you delete a file that is in git?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The veryclean option was always part of the NPB I think, for people who want to delete even the config files... I don't use it, if you want, we can delete it and have only clean and cleanall.

Copy link
Member

@devreal devreal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yay! Thanks for that effort, that had been on my list. Just two comments I made a second ago.


for (int unit_idx = 0; unit_idx < dash::size(); ++unit_idx) {
local_sizes.push_back(my_elem_count[unit_idx]);
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

indentation

NPB/CG/cg.cpp Outdated
//now setup a and colidx

typedef dash::CSRPattern<1> pattern_t;
typedef int index_t;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

indentation

NPB/CG/cg.cpp Outdated

if (0 == dash::myid()) {

printf("\n\n NAS Parallel Benchmarks 4.0 C++ DASH version"" - CG Benchmark\n");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

indentation (we typically use 2 spaces)

NPB/CG/cg.cpp Outdated
//now setup a and colidx

typedef dash::CSRPattern<1> pattern_t;
typedef int index_t;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please typedef the pattern (and also the dash::Matrix using it) somewhere centrally so that it is easy to change?

Also: is it guranteed that the matrix will never have more than 2G elements?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and instead of "typedef" i would prefer "using"

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed it to using. We are storing not the whole matrix, only the nonzero elements which are about 36M in Class C, which is the largest problem size implemented here.

NPB/FT/ft.cpp Outdated
}
}

if(0 == dash::myid()) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

indentation

NPB/FT/ft.cpp Outdated
// }
int offset = k*NX*NY+(myoffset+j)*NX;
dash::copy(&y0[k][0], &y0[k][fftblock], xout.begin()+offset+ii);
// futs_w[k] = dash::copy_async(&y0[k][0], &y0[k][fftblock], xout.begin()+offset+ii);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not use copy_async? I'd guess that allows multiple concurrent data streams, which should be faster...

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That was my guess as well, but on my computer at home, it didn't help, that's why i commented it out. This can be changed by whoever uses these benchmarks.

NPB/MG/mg.cpp Outdated
fut_bot = dash::copy_async(r.begin()+n2*n1*bottomcoord, r.begin()+n2*n1*(bottomcoord+1), &bottomplane[0]);
}

for(int i3 = 1; i3 < z_ext-1; i3++) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you need performance .local(....) is may be not the best solution. Here you might benefit of plain pointers

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here, I would say as well this is a topic which is left to be investigated when someone is using these benchmarks.

NPB/MG/mg.cpp Outdated
if ( z(i3,0,0).is_local() ) {
z_local[j3-start] = z.lbegin()+z_size*(i3-z_offset);
is_local[j3-start] = true;
futs[j3-start] = dash::copy_async(z.begin()+z_size*i3, z.begin()+z_size*i3, z_local_v[j3-start].data());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: this call to copy_async copies zero elements

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
futs[j3-start] = dash::copy_async(z.begin()+z_size*i3, z.begin()+z_size*i3, z_local_v[j3-start].data());
futs[j3-start] = dash::copy_async(z.begin()+z_size*i3, z.begin()+z_size*(i3+1), z_local_v[j3-start].data());

NPB/MG/mg.cpp Outdated
}
if ( z(i3+1,0,0).is_local() ) {
z_local[j3-start+1] = z.lbegin()+z_size*(i3+1-z_offset);
futs[j3-start+1] = dash::copy_async(z.begin()+z_size*(i3+1), z.begin()+z_size*(i3+1), z_local_v[j3-start+1].data());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: this call to copy_async copies zero elements

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
futs[j3-start+1] = dash::copy_async(z.begin()+z_size*(i3+1), z.begin()+z_size*(i3+1), z_local_v[j3-start+1].data());
futs[j3-start+1] = dash::copy_async(z.begin()+z_size*(i3+1), z.begin()+z_size*(i3+2), z_local_v[j3-start+1].data());

NPB/MG/mg.cpp Outdated
if ( r(i3,0,0).is_local() ) {
r_local[r_idx] = r.lbegin()+r_psize*(i3-r_offset);
is_local[r_idx] = true;
futs[r_idx] = dash::copy_async(r.begin()+r_psize*i3, r.begin()+r_psize*i3, r_local_v[r_idx].data());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
futs[r_idx] = dash::copy_async(r.begin()+r_psize*i3, r.begin()+r_psize*i3, r_local_v[r_idx].data());
futs[r_idx] = dash::copy_async(r.begin()+r_psize*i3, r.begin()+r_psize*(i3+1), r_local_v[r_idx].data());

NPB/MG/mg.cpp Outdated
if ( r(i3+1,0,0).is_local() ) {
r_local[r_idx] = r.lbegin()+r_psize*(i3+1-r_offset);
is_local[r_idx] = true;
futs[r_idx] = dash::copy_async(r.begin()+r_psize*(i3+1), r.begin()+r_psize*(i3+1), r_local_v[r_idx].data());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
futs[r_idx] = dash::copy_async(r.begin()+r_psize*(i3+1), r.begin()+r_psize*(i3+1), r_local_v[r_idx].data());
futs[r_idx] = dash::copy_async(r.begin()+r_psize*(i3+1), r.begin()+r_psize*(i3+2), r_local_v[r_idx].data());

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is actually the work-around for setting the futures to true without having to do any calculations. Unfortunately, if we leave them out and just use the is_local[], it would give an even more ugly looking code when those elements are accessed. We couldn't find a better way to do it without working in DART or something similar.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is actually the work-around for setting the futures to true without having to do any calculations.

I'm not sure I understand your point. Why do you need to a future in the first place then if there is no transfer happening? Such a quirk should at least be documented, preferably removed altogether.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I looked it over and yes, you are right. I omitted the 0 element calls.

@devreal
Copy link
Member

devreal commented Dec 17, 2019

After browsing through some of the code, I have another major concern: some codes allocate global memory in the critical path, e.g., every timestep. Global memory allocation is costly (orders of magnitude slower than malloc). Has anyone ever tested this on a real HPC system (ideally IB or Cray where global memory is pinned)? How does it compare against the MPI version of the NPB benchmarks? In many places these global data structures can probably be allocated once and reused, so the fix should be easy.

The reason I am concerned is this: if these benchmarks end up in the repo someone will eventually grab them and use them to compare their approach to DASH. They will not make an attempt to investigate why the performance of DASH seemingly sucks. We should be careful with putting out benchmarks where we cannot show that we are at least in the same ballpark as MPI. This would come back to haunt us...

@Mietzsch
Copy link
Author

After browsing through some of the code, I have another major concern: some codes allocate global memory in the critical path, e.g., every timestep. Global memory allocation is costly (orders of magnitude slower than malloc). Has anyone ever tested this on a real HPC system (ideally IB or Cray where global memory is pinned)? How does it compare against the MPI version of the NPB benchmarks? In many places these global data structures can probably be allocated once and reused, so the fix should be easy.

The reason I am concerned is this: if these benchmarks end up in the repo someone will eventually grab them and use them to compare their approach to DASH. They will not make an attempt to investigate why the performance of DASH seemingly sucks. We should be careful with putting out benchmarks where we cannot show that we are at least in the same ballpark as MPI. This would come back to haunt us...

No, I did not test this on a real HPC system. Unfortunately, I'm working on different projects now and I don't have the time to test and work out the new global data-structures. If anybody wants to go ahead and do it, you're more than welcome.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants