-
Notifications
You must be signed in to change notification settings - Fork 400
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Potential bug in image driver handling of parameter file, and/or tonic #787
Comments
Another potential culprit is appearing: |
Another potential culprit: parallelization. I've noticed that, if I insert print statements into anything within the loop over cells in I'm not saying that it's wrong to have parallelization per se; I'm saying that if parallelization is occurring even when I tell it to use a single processor, perhaps that could be causing a mis-match between the |
Now, the primary suspect is OpenMPI (at least the version on my machine, which might not be the right version). I deleted the line in The questions now are:
|
@tbohn -thanks for the report.
|
To answer your questions:
|
@tbohn - how are you executing VIC? Do you get the odd behavior if you don't use MPI? Can you run a test with:
|
The above test reproduces the bug. That test is identical to one of the several ways that I ran vic and encountered the bug (i.e., I ran vic both with and without mpirun), except I wasn't explicitly setting |
I think in order to close this issue we should add a test that checks for bit for bit equivalence between the single core image driver and the openmp enabled image driver. That should probably fall on me but if someone else is interested and gets to it first, go for it. |
@jhamman or @tbohn is there any update on this issue? I am running into a similar issue with NA values in my VIC output when running with OpenMPI, VIC5.1.0, and the calibrated VIC parameter file and domain from Currier et al. 2023. I get many NAs throughout the CRB domain. I have tried a variety of the suggestions in this and other threads and in the VIC image driver documentation page (MX_RCACHE = 2, removing the -fopenmp flag, etc.), as well as trying different forcings, etc. We did not build against any conda libraries. I am able to run all of the VIC sample datasets without this issue. The issue presents as either large chunks of the domain as NA immediately: |
Dear dwoodson, I left academia 4 years ago and have not worked with VIC
since. I can't help you with this issue.
Regards,
Ted Bohn
…On Mon, Dec 25, 2023 at 5:11 PM dwoodson-usbr ***@***.***> wrote:
@jhamman <https://github.com/jhamman> or @tbohn <https://github.com/tbohn>
is there any update on this issue? I am running into a similar issue with
NA values in my VIC output when running with OpenMPI, VIC5.1.0, and the
calibrated VIC parameter file and domain from Currier et al. 2023. I get
many NAs throughout the CRB domain.
I have tried a variety of the suggestions in this and other threads and in
the VIC image driver documentation page (MX_RCACHE = 2, removing the
-fopenmp flag, etc.), as well as trying different forcings, etc. We did not
build against any conda libraries.
I am able to run all of the VIC sample datasets without this issue.
The issue presents as either large chunks of the domain as NA immediately:
image.png (view on web)
<https://github.com/UW-Hydro/VIC/assets/106703387/6a89df65-931a-4d05-943a-85cff140da9b>
Or NAs becoming more prevalent over time:
image.png (view on web)
<https://github.com/UW-Hydro/VIC/assets/106703387/7a1b4b2a-6dd6-4b19-b600-30de405b4513>
—
Reply to this email directly, view it on GitHub
<#787 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AA2VS6WTG4DEAHHI74ESKLTYLIP37AVCNFSM4EUDLR4KU5DIOJSWCZC7NNSXTN2JONZXKZKDN5WW2ZLOOQ5TCOBWHEYTONZTHEYA>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
@tbohn thank you for your response! -David |
The NA values in my VIC output were caused by two issues:
|
Added description of work around for conflicts between OpenMP and OpenMPI based on UW-Hydro#787
When using parameter files that I have generated from scratch, various states and fluxes get weird values that fairly quickly lead to
nans
. This doesn't happen with parameter files that I've generated withtonic
from the Livneh et al (2015) ascii soil/veg/snowband parameter files.I believe that the problem is caused by a combination of how I created my parameter files and some potentially missing/bad logic in VIC (and
tonic
). But when I try to fix my parameter files to make them more amenable to VIC, VIC still has trouble. I am not sure whether it's because I haven't succeeding in fixing my parameter files, or whether there's some other problem occurring. I could use some advice and/or help brainstorming for solutions here...More detail:
My new parameter files generated from scratch differ from the
tonic
-generated files in a few ways:tonic
sets them to "valid" values (the same constant value) in all classes in all grid cells, regardless of whether those classes are actually present in the grid cell (i,e,Cv[v] >= 0
). In my parameters, I only defined them whereCv[v] > 0
, and set them tomissing
where the vegetation class is not present (Cv[v] == 0
).tonic
creates an extra bare soil tile in cells where the vegetated tile areas sum up to less than 1 (Cv_sum < 1
); but does NOT incrementNveg
to reflect the extra tile (i.e.,Nveg < Nnonzero
, whereNnonzero
is the number of tiles for whichCv[v] > 0
). In my parameters, the tile areas always reflect exactly what is present in the land cover map;Cv_sum
always== 1
(except for numerical error); andNveg == Nnonzero
.tonic
parameter files define all floating point variables as typedouble
. I defined mine asfloat
to save space, since I didn't need the precision.In
vic_run()
, I saw that bad values (they looked a lot like themissing
value I'd given them in my parameter file) of veg parameters such astrunk_ratio
were being passed toCalcAerodynamic()
. This could certainly explain why thetonic
parameter files didn't cause problems (since even invalid tiles have valid parameter values intonic
-generated parameter files). To verify thatmissing
values from invalid tiles were getting passed, I replaced those values in the parameter file with an arbitrary value (again, only whereCv[v] == 0
), and these values showed up being passed toCalcAerodynamic()
.As for why these garbage values were being used, I hypothesized that VIC was somehow mistaking zero-area tiles for valid tiles. Looking at the code, in
get_global_domain()
, VIC assignsnv_active
toNveg+1
, whereNveg
is the value from the parameter file; then invic_alloc()
it allocatesnv_active
tiles toveg_con
and callsmake_all_vars()
withnv_active-1
, which allocatesnv_active-1+1
tiles toveg_var
,cell
, etc. But then,vic_init()
does its mapping between the all-classes-stored structureveg_con_map
and the only-nonzero-classes structureveg_con
by checking whetherCv > 0
.Nveg
is NOT involved in this mapping. My theory at the moment is that my single-precisionCv
values might not look like 0 to VIC when VIC is interpreting them as doubles, thus causing the mapping to map invalid tiles to theveg_con
structure.However, I've tried changing
Cv
in my parameter file todouble
, and tried rounding all values to 6 digits of precision to ensure that 0s look like 0s. But it doesn't fix the problem. So, if I've not properly zeroed the 0s this way, I could definitely use help getting that to work correctly (my python skills are still not at the level of the UW lab).But I don't know for sure if that's what happened - maybe I did succeed in zeroing out the 0s (it looked that way in a python session), and maybe something else is causing the problem in VIC.
Regardless, I can also see that VIC and tonic might benefit from improved code to counter these problems. Tonic's inconsistent incrementing of
Nveg
seems dangerous. In VIC, we could easily implement a check on whether the total number of tiles withCv[v] > 0
actually equalsNveg
- if not, it could give the user a helpful error message so that the user knows exactly what the problem is.Also, I think tonic's assignment of valid parameter values to cell/class combinations where the classes are not present is a bad policy, since if VIC for some other reason were to accidentally assign invalid tiles, the error would not be caught (the parameters would be physically reasonable, just wrong).
Any thoughts/advice on this would be greatly appreciated. I need to get my simulations going ASAP!
The text was updated successfully, but these errors were encountered: