intermittent segmentation faults caused by `partitioner.data_partition` #120

samdporter · 2024-09-30T22:53:14Z

Unfortunately far too late to do anything about it now...

I'm seeing intermittent segmentation faults caused by the partitioner.data_partition function. It's only apparent when using the edge-gpu docker image and I haven't seen it before today - but this could possibly have been down to luck as I can't see an obvious culprit in any recent commits.

I don't see this when I run locally.

The text was updated successfully, but these errors were encountered:

KrisThielemans · 2024-10-02T13:14:03Z

@samdporter can you give some more detail? How did you run the data_partition function? Ideally code snippet. Did you see GPU errors such as

cudaMalloc returned error no CUDA-capable device is detected (code 100), line(57)

samdporter · 2024-10-02T13:51:30Z

Hey Kris,
The partition was used in the same way as in the example files (in fact I saw the same behaviour when using main_ISTA.py)
The error was segmentation fault (core dumped) - exactly the same as I've previously seen when using the partitioner without setting AcuisitionData.set_storage_scheme('memory'). This only ever occurred when using the partitioner and an edge-gpu docker container.

class Submission(ISTA):

    def __init__(self, data: Dataset, update_objective_interval=10):
        """
        Initialisation function, setting up data & (hyper)parameters.
        """
        # Very simple heuristic to determine the number of subsets
        self.num_subsets = calculate_subsets(data.acquired_data, min_counts_per_subset=2**20, max_num_subsets=16) 
        update_interval = self.num_subsets
        # 10% decay per update interval
        decay_perc = 0.1
        decay = (1/(1-decay_perc) - 1)/update_interval
        beta = 0.5

       # error only ever occurs here

        _, _, obj_funs = partitioner.data_partition(data.acquired_data, data.additive_term,
                                                                    data.mult_factors, self.num_subsets, mode='staggered',
                                                                    initial_image=data.OSEM_image)

KrisThielemans · 2024-10-03T10:00:44Z

AcquisitionData.set_storage_scheme('memory') is currently required for the subsets. I'd have hoped it would generate a warning as opposed to a crash.

Can you confirm you had crashes with "memory" on?

KrisThielemans · 2024-10-08T13:37:32Z

@samdporter can you please confirm here that

you did not see cudaMalloc errors
you saw this when running main_ISTA, both on your edge-gpu docker image and when you submitted it (if there's an explicit job/tag you could refer to, that'd be great)
you never saw this on your "native build"

samdporter · 2024-10-09T09:57:35Z

No cudaMalloc errors
I saw this error on my edge-gpu docker image but never attempted to submit main_ISTA. I saw this issue using my algorithms on my edge-gpu docker image and when submitting. Here is the job tag for the most recent submission. (I have just resubmitted the job in a container running on my machine and it's working fine, which is a bit confusing).
I can confirm that I never saw this on my native build.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

intermittent segmentation faults caused by `partitioner.data_partition` #120

intermittent segmentation faults caused by `partitioner.data_partition` #120

samdporter commented Sep 30, 2024

KrisThielemans commented Oct 2, 2024

samdporter commented Oct 2, 2024

KrisThielemans commented Oct 3, 2024

KrisThielemans commented Oct 8, 2024

samdporter commented Oct 9, 2024

intermittent segmentation faults caused by partitioner.data_partition #120

intermittent segmentation faults caused by partitioner.data_partition #120

Comments

samdporter commented Sep 30, 2024

KrisThielemans commented Oct 2, 2024

samdporter commented Oct 2, 2024

KrisThielemans commented Oct 3, 2024

KrisThielemans commented Oct 8, 2024

samdporter commented Oct 9, 2024

intermittent segmentation faults caused by `partitioner.data_partition` #120

intermittent segmentation faults caused by `partitioner.data_partition` #120