Add --end-year YYYY to postprocessing #210

cwhitlock-NOAA · 2024-10-18T18:18:25Z

Prior versions of fre relied on specifying the last (and sometimes first) years of post-processed data to control how many years of data were processed at once. This is how fre dealt with chunks of history files being copied over from wherever the model was running; the postprocessing syntax looked a bit like this:

for (year in range(1980, 2010)):
  fre pp wrapper -e $experiment -p $platform -t $target --end-year $year
  sleep 4 hours
end

where the large pause in between successive calls to bronx-or-earlier's wrapper equivalent gave time for new files to be transferred over to the pp nodes and fre was smart enough to know that prior years were post-processed and only the data in range ($year-1) - $year needed to be processed with this call.

This functionality is not present in the fre-cli codebase, and if we want to maintain backwards-comptaibility on this particular command we'd need to change our command-line options. However, we may NOT need it - canopy is capable of pausing jobs and running again when new data is present. The logic flow for that would look more like this:

while ($data_is_not_all_present):
  fre pp wrapper -e $experiment -p $platform -t $target
 #fre pp wrapper is smart enough to check for new data and tell canopy it is present if data is there
  sleep 4 hours
end

Whether or not we implement this is going to depend a lot on whether the users miss this functionality - but for now, it's improvement to remember that this functionality is NOT present in fre-cli.

The text was updated successfully, but these errors were encountered:

ceblanton · 2024-10-18T18:28:59Z

Thanks for organizing this!

The forward-looking approach that does not need the "-t YYYY" clue in order for the workflow to know that the year's YYYY output is ready for postprocessing is even easier than how you had it. The Cylc workflow itself can know whether the history files are available, so strictly speaking it should not need the wrapper to tell it anything about when new history files arrive.

Custom (task) triggering functions is this Cylc feature:

https://cylc.github.io/cylc-doc/stable/html/user-guide/writing-workflows/external-triggers.html#custom-trigger-functions

We don't yet use it in fre-workflows, but the pp-shield prototype cylc template does use it. e.g. (https://gitlab.gfdl.noaa.gov/fre2/workflows/pp-shield/-/blob/main/include/shield/shield.cylc?ref_type=heads#L32)

    [[xtriggers]]
        history_complete = history_complete(model={{ MODEL }}, \
                                            point=%(point)s, \
                                            env_script={{ NGGPS_DIR }}/parm/set_vars.sh, \
                                            state_file_template={{ NGGPS_DIR }}/do_pp/LOGS/NOTE_STATUS.log)
    [[graph]]
        T00,T06,T12,T18 = """
            @history_complete => make-pp-script => run-pp1 => run-pp2<history> => runtrak?
        """

The "is_history_complete" trigger there is run continuously by the Cylc scheduler, and when it passes for a certain cycle point (year in the PP context), the make-pp-script task will be triggered.

So in an ideal world, we can totally outsource the history file present logic to Cylc. But until we master the cylc triggers, we'll want the ability to have the wrapper tell the workflow to start a particular year (cycle point) of postprocessing.

cwhitlock-NOAA · 2024-10-18T18:43:03Z

I'd like to make clear that we don't need the is_history_complete trigger either - the current just-enough-functionality plan for fre pp run triggers a couple possible cylc commands:

cylc run (start the job)
cylc reload (restart the job)
cylc trigger (tell the job there is new data to use)

The minimal functionality to get postprocessing updated with new data is more like the following:

for (n in 1:way longer than we think we need)
  fre pp wrapper -e $experiment -p $platform -t $target
 #fre pp wrapper is smart enough to check for new data and tell canopy it is present if data is there (cylc run, reload or trigger)
  sleep 4 hours    
end

see: https://github.com/NOAA-GFDL/fre-cli/tree/main/fre/pp#readme

cwhitlock-NOAA · 2024-10-18T18:55:55Z

After talking with Chris, this functionality seems to come in 3 stages:

cylc run calls cylc tigger
- not an approved cylc solution - think kicking a stuck converyor belt
- dead simple, we've been discussing it for months
YYYY year date range specifications on the command-line
- useful to have, we can get by without it
cylc external triggers (history_complete)
- cylc-approved solution, does not require awkward moving parts
- requires us to understand what cylc external triggers are

ceblanton · 2024-10-18T19:16:42Z

Yes, agreed. Cylc external triggers are the ultimate solution to workflows knowing when input data is available, and regular interactive Cylc task control (cylc trigger WORKFLOW//CYCLE-POINT/TASK) through humans typing it and cylc pp wrapper using it is perfectly fine for now and we know it well.

cwhitlock-NOAA assigned cwhitlock-NOAA, ceblanton and singhd789 Oct 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add --end-year YYYY to postprocessing #210

Add --end-year YYYY to postprocessing #210

cwhitlock-NOAA commented Oct 18, 2024 •

edited

Loading

ceblanton commented Oct 18, 2024

cwhitlock-NOAA commented Oct 18, 2024

cwhitlock-NOAA commented Oct 18, 2024

ceblanton commented Oct 18, 2024

Add --end-year YYYY to postprocessing #210

Add --end-year YYYY to postprocessing #210

Comments

cwhitlock-NOAA commented Oct 18, 2024 • edited Loading

ceblanton commented Oct 18, 2024

cwhitlock-NOAA commented Oct 18, 2024

cwhitlock-NOAA commented Oct 18, 2024

ceblanton commented Oct 18, 2024

cwhitlock-NOAA commented Oct 18, 2024 •

edited

Loading