Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Exception inside OpenMP loops cause terminate #44

Open
1 task done
hmenke opened this issue Oct 25, 2023 · 0 comments
Open
1 task done

Exception inside OpenMP loops cause terminate #44

hmenke opened this issue Oct 25, 2023 · 0 comments
Labels
bug Something isn't working

Comments

@hmenke
Copy link
Member

hmenke commented Oct 25, 2023

Prerequisites

Description

Sometimes we make mistakes, which leads TRIQS to throw exceptions. For example when trying to Fourier transform a Green's function with only a single frequency something like this happens:

Example Python script
import numpy as np
from triqs.gf import MeshImFreq
from triqs_tprf.tight_binding import TBLattice

t = 1.0
H = TBLattice(
    units = [(1, 0, 0), (0, 1, 0)],
    hopping = {
        # nearest neighbour hopping -t
        ( 0,+1): -t * np.eye(2),
        ( 0,-1): -t * np.eye(2),
        (+1, 0): -t * np.eye(2),
        (-1, 0): -t * np.eye(2),
        },
    orbital_positions = [(0,0,0)]*2,
    orbital_names = ['up', 'do'],
    )

kmesh = H.get_kmesh(n_k=(32, 32, 1))
e_k = H.fourier(kmesh)

from triqs_tprf.lattice import lattice_dyson_g0_wk, fourier_wk_to_wr, fourier_wr_to_tr
iw_mesh = MeshImFreq(beta=10.0, S='Fermion', n_max=1)
g0_wk = lattice_dyson_g0_wk(mu=0.0, e_k=e_k, mesh=iw_mesh)
g0_wr = fourier_wk_to_wr(g0_wk)
g0_tr = fourier_wr_to_tr(g0_wr)
libc++abi: terminating due to uncaught exception of type triqs::runtime_error: Triqs runtime error
    at /usr/include/triqs/./mesh/./tail_fitter.hpp : 157

Insufficient data points for least square procedure
Exception was thrown on node 
Aborted (core dumped)

However, as you can see this is not a Python exception but an unhandled C++ exception has caused the entire process to abort. This is quite annoying when prototyping in a Jupyter notebook, because every time this happens the entire Jupyter kernel dies.

After some digging I found that this is due to the fact that exception are not allowed to leave OpenMP parallel regions. From the OpenMP specifiction:

A throw executed inside a parallel region must cause execution to resume within the same parallel region, and the same thread that threw the exception must catch it.

Steps to Reproduce

Trying to catch an exception thrown inside a parallel region outside of it causes abort() to be called.

#include <iostream>
#include <stdexcept>

void do_stuff(int i) {
    if (i == 5) {
        throw std::out_of_range("oops");
    }
}

int main() {
    try {
        #pragma omp parallel for
        for (int i = 0; i < 10; ++i) {
            do_stuff(i);
        }
    } catch (std::exception const &e) {
        std::cout << "Exception occurred: " << e.what() << "\n";
    }
}

One possibility would be to embellish all the parallel regions with a std::exception_ptr which stores the last uncaught exception and rethrows it outside the region. This does not cover the of multiple (possibly different) exceptions being thrown on different threads, but I also don't see a straightforward way to convert a stack of C++ exceptions into a Python exception.

#include <exception>
#include <iostream>
#include <stdexcept>

void do_stuff(int i) {
    if (i == 5) {
        throw std::out_of_range("oops");
    }
}

int main() {
    try {
        std::exception_ptr eptr;
        #pragma omp parallel for
        for (int i = 0; i < 10; ++i) {
            try {
                do_stuff(i);
            } catch (...) {
                #pragma omp critical
                eptr = std::current_exception();
            }
        }
        if (eptr) {
            std::rethrow_exception(eptr);
        }
    } catch (std::exception const &e) {
        std::cout << "Exception occurred: " << e.what() << "\n";
    }
}

Performance in the exceptional case where individual loop iteration might take long can further be improved by using OpenMP cancellation points. However, this requires that the user exports OMP_CANCELLATION=1

#include <chrono>
#include <exception>
#include <iostream>
#include <stdexcept>
#include <thread>

void do_stuff(int i) {
    using namespace std::chrono_literals;
    std::this_thread::sleep_for(i*10ms);
    if (i == 5) {
        throw std::out_of_range("oops");
    }
}

int main() {
    try {
        std::exception_ptr eptr;
        #pragma omp parallel for
        for (int i = 0; i < 100; ++i) {
            try {
                do_stuff(i);
            } catch (...) {
                #pragma omp critical
                eptr = std::current_exception();
                #pragma omp cancel for
            }
            #pragma omp cancellation point for
        }
        if (eptr) {
            std::rethrow_exception(eptr);
        }
    } catch (std::exception const &e) {
        std::cout << "Exception occurred: " << e.what() << "\n";
    }
}

Expected behavior: Get a Python exception

Actual behavior: Unhandled C++ exception causes abort()

Versions

$ python3 -c "from triqs_tprf.version import *; show_version(); show_git_hash();"

You are using triqs_tprf version 3.2.0


You are using triqs_tprf git hash ce36521536d8b7acdcb693fe2d0d15135ecb16fd based on triqs git hash e1fa5dd2c8984e334574163f6323e956a49ffbd5

$ grep VERSION= /etc/os-release 
VERSION="20.04.4 LTS (Focal Fossa)"

Formatting

Please use markdown in your issue message. A useful summary of commands can be found here.

Additional Information

Any additional information, configuration or data that might be necessary to reproduce the issue.

@hmenke hmenke added the bug Something isn't working label Oct 25, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant