Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

intermittent segmentation faults caused by partitioner.data_partition #120

Open
samdporter opened this issue Sep 30, 2024 · 5 comments
Open

Comments

@samdporter
Copy link

Unfortunately far too late to do anything about it now...

I'm seeing intermittent segmentation faults caused by the partitioner.data_partition function. It's only apparent when using the edge-gpu docker image and I haven't seen it before today - but this could possibly have been down to luck as I can't see an obvious culprit in any recent commits.

I don't see this when I run locally.

@KrisThielemans
Copy link
Member

@samdporter can you give some more detail? How did you run the data_partition function? Ideally code snippet. Did you see GPU errors such as

cudaMalloc returned error no CUDA-capable device is detected (code 100), line(57)

@samdporter
Copy link
Author

Hey Kris,
The partition was used in the same way as in the example files (in fact I saw the same behaviour when using main_ISTA.py)
The error was segmentation fault (core dumped) - exactly the same as I've previously seen when using the partitioner without setting AcuisitionData.set_storage_scheme('memory'). This only ever occurred when using the partitioner and an edge-gpu docker container.

class Submission(ISTA):

    def __init__(self, data: Dataset, update_objective_interval=10):
        """
        Initialisation function, setting up data & (hyper)parameters.
        """
        # Very simple heuristic to determine the number of subsets
        self.num_subsets = calculate_subsets(data.acquired_data, min_counts_per_subset=2**20, max_num_subsets=16) 
        update_interval = self.num_subsets
        # 10% decay per update interval
        decay_perc = 0.1
        decay = (1/(1-decay_perc) - 1)/update_interval
        beta = 0.5

       # error only ever occurs here

        _, _, obj_funs = partitioner.data_partition(data.acquired_data, data.additive_term,
                                                                    data.mult_factors, self.num_subsets, mode='staggered',
                                                                    initial_image=data.OSEM_image)

@KrisThielemans
Copy link
Member

AcquisitionData.set_storage_scheme('memory') is currently required for the subsets. I'd have hoped it would generate a warning as opposed to a crash.

Can you confirm you had crashes with "memory" on?

@KrisThielemans
Copy link
Member

@samdporter can you please confirm here that

  • you did not see cudaMalloc errors
  • you saw this when running main_ISTA, both on your edge-gpu docker image and when you submitted it (if there's an explicit job/tag you could refer to, that'd be great)
  • you never saw this on your "native build"

@samdporter
Copy link
Author

  • No cudaMalloc errors
  • I saw this error on my edge-gpu docker image but never attempted to submit main_ISTA. I saw this issue using my algorithms on my edge-gpu docker image and when submitting. Here is the job tag for the most recent submission. (I have just resubmitted the job in a container running on my machine and it's working fine, which is a bit confusing).
  • I can confirm that I never saw this on my native build.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants