added NPB kernels: CG, EP, FT, IS, MG #13

Mietzsch · 2019-10-10T10:20:28Z

This is an implementation of the five original NPB kernels using DASH. We use several aspects of DASH, including DASH algorithms, CSR patterns and async_copy.

devreal · 2019-10-10T11:38:33Z

NPB/config/make.def

+#---------------------------------------------------------------------------
+# This is the C compiler used for OpenMP programs
+#---------------------------------------------------------------------------
+CC = mpicxx


Any reason you're not using dash-mpicxx?

Right, changed that.

devreal · 2019-10-10T11:39:33Z

NPB/Makefile

+	- rm -r bin/*
+
+veryclean: clean
+	- rm config/make.def config/suite.def Part*


Why do you delete a file that is in git?

The veryclean option was always part of the NPB I think, for people who want to delete even the config files... I don't use it, if you want, we can delete it and have only clean and cleanall.

devreal

Yay! Thanks for that effort, that had been on my list. Just two comments I made a second ago.

devreal · 2019-10-10T13:04:49Z

NPB/CG/cg.cpp

+
+	for (int unit_idx = 0; unit_idx < dash::size(); ++unit_idx) {
+    local_sizes.push_back(my_elem_count[unit_idx]);
+  }


indentation

devreal · 2019-10-10T13:04:55Z

NPB/CG/cg.cpp

+	//now setup a and colidx
+
+	typedef dash::CSRPattern<1>           pattern_t;
+  typedef int                           index_t;


indentation

devreal · 2019-10-10T13:05:16Z

NPB/CG/cg.cpp

+
+	if (0 == dash::myid()) {
+
+		printf("\n\n NAS Parallel Benchmarks 4.0 C++ DASH version"" - CG Benchmark\n");


indentation (we typically use 2 spaces)

devreal · 2019-10-10T13:07:47Z

NPB/CG/cg.cpp

+	//now setup a and colidx
+
+	typedef dash::CSRPattern<1>           pattern_t;
+  typedef int                           index_t;


Could you please typedef the pattern (and also the dash::Matrix using it) somewhere centrally so that it is easy to change?

Also: is it guranteed that the matrix will never have more than 2G elements?

and instead of "typedef" i would prefer "using"

I changed it to using. We are storing not the whole matrix, only the nonzero elements which are about 36M in Class C, which is the largest problem size implemented here.

devreal · 2019-10-10T13:14:58Z

NPB/FT/ft.cpp

+		}
+	}
+
+if(0 == dash::myid()) {


indentation

devreal · 2019-10-10T13:20:45Z

NPB/FT/ft.cpp

+				// }
+				int offset = k*NX*NY+(myoffset+j)*NX;
+				dash::copy(&y0[k][0], &y0[k][fftblock], xout.begin()+offset+ii);
+				// futs_w[k] = dash::copy_async(&y0[k][0], &y0[k][fftblock], xout.begin()+offset+ii);


Why not use copy_async? I'd guess that allows multiple concurrent data streams, which should be faster...

That was my guess as well, but on my computer at home, it didn't help, that's why i commented it out. This can be changed by whoever uses these benchmarks.

dhinf · 2019-10-10T14:55:48Z

NPB/MG/mg.cpp

+		fut_bot = dash::copy_async(r.begin()+n2*n1*bottomcoord, r.begin()+n2*n1*(bottomcoord+1), &bottomplane[0]);
+	}
+
+	for(int i3 = 1; i3 < z_ext-1; i3++) {


If you need performance .local(....) is may be not the best solution. Here you might benefit of plain pointers

Here, I would say as well this is a topic which is left to be investigated when someone is using these benchmarks.

devreal · 2019-12-10T16:21:07Z

NPB/MG/mg.cpp

+        if ( z(i3,0,0).is_local() ) {
+          z_local[j3-start] = z.lbegin()+z_size*(i3-z_offset);
+          is_local[j3-start] = true;
+          futs[j3-start] = dash::copy_async(z.begin()+z_size*i3, z.begin()+z_size*i3, z_local_v[j3-start].data());


Note: this call to copy_async copies zero elements

Suggested change

futs[j3-start] = dash::copy_async(z.begin()+z_size*i3, z.begin()+z_size*i3, z_local_v[j3-start].data());

futs[j3-start] = dash::copy_async(z.begin()+z_size*i3, z.begin()+z_size*(i3+1), z_local_v[j3-start].data());

devreal · 2019-12-10T16:21:18Z

NPB/MG/mg.cpp

+        }
+        if ( z(i3+1,0,0).is_local() ) {
+          z_local[j3-start+1] = z.lbegin()+z_size*(i3+1-z_offset);
+          futs[j3-start+1] = dash::copy_async(z.begin()+z_size*(i3+1), z.begin()+z_size*(i3+1), z_local_v[j3-start+1].data());


Note: this call to copy_async copies zero elements

Suggested change

futs[j3-start+1] = dash::copy_async(z.begin()+z_size*(i3+1), z.begin()+z_size*(i3+1), z_local_v[j3-start+1].data());

futs[j3-start+1] = dash::copy_async(z.begin()+z_size*(i3+1), z.begin()+z_size*(i3+2), z_local_v[j3-start+1].data());

devreal · 2019-12-10T16:26:00Z

NPB/MG/mg.cpp

+    if ( r(i3,0,0).is_local() ) {
+      r_local[r_idx] = r.lbegin()+r_psize*(i3-r_offset);
+      is_local[r_idx] = true;
+      futs[r_idx] = dash::copy_async(r.begin()+r_psize*i3, r.begin()+r_psize*i3, r_local_v[r_idx].data());


Suggested change

futs[r_idx] = dash::copy_async(r.begin()+r_psize*i3, r.begin()+r_psize*i3, r_local_v[r_idx].data());

futs[r_idx] = dash::copy_async(r.begin()+r_psize*i3, r.begin()+r_psize*(i3+1), r_local_v[r_idx].data());

devreal · 2019-12-10T16:26:21Z

NPB/MG/mg.cpp

+    if ( r(i3+1,0,0).is_local() ) {
+      r_local[r_idx] = r.lbegin()+r_psize*(i3+1-r_offset);
+      is_local[r_idx] = true;
+      futs[r_idx] = dash::copy_async(r.begin()+r_psize*(i3+1), r.begin()+r_psize*(i3+1), r_local_v[r_idx].data());


Suggested change

futs[r_idx] = dash::copy_async(r.begin()+r_psize*(i3+1), r.begin()+r_psize*(i3+1), r_local_v[r_idx].data());

futs[r_idx] = dash::copy_async(r.begin()+r_psize*(i3+1), r.begin()+r_psize*(i3+2), r_local_v[r_idx].data());

This is actually the work-around for setting the futures to true without having to do any calculations. Unfortunately, if we leave them out and just use the is_local[], it would give an even more ugly looking code when those elements are accessed. We couldn't find a better way to do it without working in DART or something similar.

This is actually the work-around for setting the futures to true without having to do any calculations.

I'm not sure I understand your point. Why do you need to a future in the first place then if there is no transfer happening? Such a quirk should at least be documented, preferably removed altogether.

I looked it over and yes, you are right. I omitted the 0 element calls.

devreal · 2019-12-17T07:59:03Z

After browsing through some of the code, I have another major concern: some codes allocate global memory in the critical path, e.g., every timestep. Global memory allocation is costly (orders of magnitude slower than malloc). Has anyone ever tested this on a real HPC system (ideally IB or Cray where global memory is pinned)? How does it compare against the MPI version of the NPB benchmarks? In many places these global data structures can probably be allocated once and reused, so the fix should be easy.

The reason I am concerned is this: if these benchmarks end up in the repo someone will eventually grab them and use them to compare their approach to DASH. They will not make an attempt to investigate why the performance of DASH seemingly sucks. We should be careful with putting out benchmarks where we cannot show that we are at least in the same ballpark as MPI. This would come back to haunt us...

Mietzsch · 2019-12-23T13:24:44Z

After browsing through some of the code, I have another major concern: some codes allocate global memory in the critical path, e.g., every timestep. Global memory allocation is costly (orders of magnitude slower than malloc). Has anyone ever tested this on a real HPC system (ideally IB or Cray where global memory is pinned)? How does it compare against the MPI version of the NPB benchmarks? In many places these global data structures can probably be allocated once and reused, so the fix should be easy.

The reason I am concerned is this: if these benchmarks end up in the repo someone will eventually grab them and use them to compare their approach to DASH. They will not make an attempt to investigate why the performance of DASH seemingly sucks. We should be careful with putting out benchmarks where we cannot show that we are at least in the same ballpark as MPI. This would come back to haunt us...

No, I did not test this on a real HPC system. Unfortunately, I'm working on different projects now and I don't have the time to test and work out the new global data-structures. If anybody wants to go ahead and do it, you're more than welcome.

added NPB kernels: CG, EP, FT, IS, MG

e55a90e

devreal reviewed Oct 10, 2019

View reviewed changes

devreal requested changes Oct 10, 2019

View reviewed changes

Mietzsch added 3 commits October 10, 2019 14:54

changed compiler and deleted setparams

db46406

deleted auxiliary files

2e05c6f

... and binaries

44725eb

devreal reviewed Oct 10, 2019

View reviewed changes

NPB/CG/cg.cpp Outdated

//now setup a and colidx

typedef dash::CSRPattern<1> pattern_t;

typedef int index_t;

Copy link

Member

devreal Oct 10, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

indentation

devreal reviewed Oct 10, 2019

View reviewed changes

NPB/FT/ft.cpp Outdated

}

}

if(0 == dash::myid()) {

Copy link

Member

devreal Oct 10, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

indentation

devreal reviewed Oct 10, 2019

View reviewed changes

dhinf reviewed Oct 10, 2019

View reviewed changes

Fixed some issues from the pull-request discussion.

994b9d3

devreal reviewed Dec 10, 2019

View reviewed changes

deleted the strange 0 element copy_async calls.

91a4e5e

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

added NPB kernels: CG, EP, FT, IS, MG #13

added NPB kernels: CG, EP, FT, IS, MG #13

Mietzsch commented Oct 10, 2019

devreal Oct 10, 2019

Mietzsch Oct 10, 2019

devreal Oct 10, 2019

Mietzsch Oct 10, 2019

devreal left a comment

devreal Oct 10, 2019

devreal Oct 10, 2019

devreal Oct 10, 2019

devreal Oct 10, 2019

dhinf Oct 10, 2019

Mietzsch Oct 10, 2019

devreal Oct 10, 2019

devreal Oct 10, 2019

Mietzsch Oct 10, 2019

dhinf Oct 10, 2019

Mietzsch Oct 10, 2019

devreal Dec 10, 2019

devreal Dec 10, 2019

devreal Dec 10, 2019

devreal Dec 10, 2019

devreal Dec 10, 2019

devreal Dec 10, 2019

Mietzsch Dec 16, 2019

devreal Dec 17, 2019

Mietzsch Dec 23, 2019

devreal commented Dec 17, 2019

Mietzsch commented Dec 23, 2019


		if (0 == dash::myid()) {

		printf("\n\n NAS Parallel Benchmarks 4.0 C++ DASH version"" - CG Benchmark\n");

	futs[j3-start] = dash::copy_async(z.begin()+z_sizei3, z.begin()+z_sizei3, z_local_v[j3-start].data());
	futs[j3-start] = dash::copy_async(z.begin()+z_sizei3, z.begin()+z_size(i3+1), z_local_v[j3-start].data());

	futs[j3-start+1] = dash::copy_async(z.begin()+z_size(i3+1), z.begin()+z_size(i3+1), z_local_v[j3-start+1].data());
	futs[j3-start+1] = dash::copy_async(z.begin()+z_size(i3+1), z.begin()+z_size(i3+2), z_local_v[j3-start+1].data());

	futs[r_idx] = dash::copy_async(r.begin()+r_psizei3, r.begin()+r_psizei3, r_local_v[r_idx].data());
	futs[r_idx] = dash::copy_async(r.begin()+r_psizei3, r.begin()+r_psize(i3+1), r_local_v[r_idx].data());

added NPB kernels: CG, EP, FT, IS, MG #13

Are you sure you want to change the base?

added NPB kernels: CG, EP, FT, IS, MG #13

Conversation

Mietzsch commented Oct 10, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

devreal left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

devreal commented Dec 17, 2019

Mietzsch commented Dec 23, 2019