Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixing bug in multiwrite with multiple MPI writes #37

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

eganhila
Copy link

This fixes a bug in my use case where particles were not being output correctly for RHybrid+corsair in restart files. Previously simulations would either quietly write out particles in nonsense locations leaving large empty gaps in the correct file offset or die with a segfault, depending on the MPI implementation (OpenMpi, intel mpi). This was triggered anytime a processor had more than the number of particles that would correspond to the systems MaxBytesPerWrite.

Copy link
Contributor

@sandroos sandroos left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So basically you're saying that dataSize is different than datatypeByteSize in your case? If this is the case it might create some other issues too / there may be a bug elsewhere in hyb/corsair. Can you post an example of the data you're trying to write.

@eganhila
Copy link
Author

So when using corsair+rhybrid, the dataSize ends up being 56 while the datatypeByteSize is 1. While the true particle size is 56, this is already wrapped up in amount. I thought this was the intended behavior because of the discussion in lines 236-244 of corsair/src/kernel/restart_writer.cpp

// NOTE: addMultiwriteUnit function call here binds to the template wrapper function // which creates an MPI datatype for char pointer, when it should be a continuous // array of chars with arrayByteSizes[i] elements. This is corrected by writing // arraySizes[i][c]*arrayByteSizes[i] elements instead of arraySizes[i][c].

So amount contains 56*actual particle number, and the datatypeByteSize is given as 1, but then because the datatype vlsv gets is "unknown", it interprets dataSize as 56. It all seems to output correctly if I switch the increment to unitOffset (checking using analysator/pyVlsv), but I think there is still a problem with reading in the subsequent particle data even when combined with PR#31.

I'm guessing based on your reaction that this is not exactly how this is supposed to work though?

To test this all I've been runnning isolated test simulations where I just inject 1000 particles per grid cell each time step with a small velocity so they remain approximately in the same place, and watch when the outputs start to break down. My test script is here, and the modifications I've made to rhybrid to the uniform injector are here. I've also been working with corsair@ commit 67654ed398, and rhybrid@ commit 2b6c32585. I can share the subsequent datafiles, but they're quite large (since the particle data has to approach maxBytesPerWrite to trigger this), so let me know how would be best for you?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants