Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Interoperability with fork() #223

Open
nspark opened this issue Jun 11, 2018 · 3 comments · May be fixed by #474
Open

Interoperability with fork() #223

nspark opened this issue Jun 11, 2018 · 3 comments · May be fixed by #474
Assignees

Comments

@nspark
Copy link
Contributor

nspark commented Jun 11, 2018

Recently, I've worked with some other users who have needed to spawn child subprocesses from their OpenSHMEM applications. This process took us through three different implementation paths:

  • Typically, one would call fork()exec(). Unfortunately, this does not seem to work portably [*], as some SHMEM libraries seg-fault at the fork() call.
  • In the course of experimentation, users found that calling vfork()exec() worked portably [*], but vfork() was marked obsolete in POSIX-2001, removed from POSIX-2008, and is not without its critics (ref: 1, 2).
  • Using posix_spawn() seems to work portably [*]. (Somewhat ironically, it uses vfork() under the hood on Linux.)`

[*] where "portably" means "across all the OpenSHMEM implementations to which I have access, all of which are running on Linux".

To be clear, the child processes -- either after fork() or after exec() -- will not make any calls into the OpenSHMEM library. The application is not calling fork() without an exec(). (Of secondary interest may be the case in which the application calls fork() but does not call exec(). I haven't seen applications that do this, since I would think threads would be preferred.)

Should the OpenSHMEM specification take a position on interoperability with fork()?

@manjugv
Copy link
Collaborator

manjugv commented Jun 25, 2018

We should make a statement, IMO. There are also other scenarios where clarification would be helpful. For example, what if process launched by the job launcher does not call shmem_init but a process forked from the launched process calls shmem_init ? Should that be valid ?

Nick, would like to discuss this in the next threads meeting ?

@nspark
Copy link
Contributor Author

nspark commented Jun 25, 2018

For example, what if process launched by the job launcher does not call shmem_init but a process forked from the launched process calls shmem_init ? Should that be valid ?

Yes, I generally think this should be a valid scenario. I have used/relied on this many times to write wrapper applications that create and monitor (e.g., hardware counters) the SHMEM application as a child process. I know of other applications for which others have used a similar pattern. That said, I'm not sure the OpenSHMEM Specification can or should address this. The situation you describe spans the parallel process launcher, the parent (or primary) process, and child process(es).

The systems that I use generally all run Slurm, so I'm relying on an assumption that Slurm informs shmem_init() (and thereby PMI_Init()) through environment variables. By clearing or modifying the environment of the child process, a child might not be able to properly initialize the SHMEM library. Since I think it means extending further into the nuances of job launchers and resource managers, I'm not comfortable saying that the OpenSHMEM Specification should make a claim that this should always be a valid scenario. At most, I think this should be "implementation defined" behavior.

@nspark
Copy link
Contributor Author

nspark commented Jun 26, 2018

Here is the test program that I promised...

#include <stdlib.h>
#include <stdio.h>
#include <string.h>
#include <errno.h>
#include <unistd.h>
#include <sys/types.h>
#include <sys/wait.h>
#include <spawn.h>
#include <shmem.h>

extern char **environ;

int main(int argc, char *argv[])
{
  if (argc == 1)
  {
    printf("Hello, from the child process\n");
    return 0;
  }

  shmem_init();

  const int spawn_mode = atoi(argv[1]);
  char * spawn_prog = argv[argc >= 3 ? 2 : 0];

  int rc, errnum;
  pid_t pid = 0;
  char *child_argv[2] = { spawn_prog, NULL };
  switch (spawn_mode)
  {
  case 1:
  case 2:
    errno = 0;
    pid = (spawn_mode == 1 ? fork() : vfork());
    errnum = errno;
    if (pid == -1)
    {
      fprintf(stderr, "ERROR: %s returned -1 (%d: %s)\n",
              (spawn_mode == 1 ? "fork()" : "vfork()"),
              errnum, strerror(errnum));
      exit(1);
    }
    else if (pid == 0)
    {
      errno = 0;
      rc = execvp(spawn_prog, child_argv);
      errnum = errno;
      if (rc == -1)
      {
        fprintf(stderr, "ERROR: execvp returned -1 (%d: %s)\n",
                errnum, strerror(errnum));
        _Exit(1); // _Exit is needed for vfork()
      }
    }
    break;
  case 3:
    rc = posix_spawn(&pid, spawn_prog, NULL, NULL, child_argv, environ);
    if (rc != 0)
    {
      fprintf(stderr, "ERROR: posix_spawn returned non-zero (%d: %s)\n",
              rc, strerror(rc));
      exit(1);
    }
    break;
  default:
    fprintf(stderr, "ERROR: invalid spawn mode\n");
    exit(1);
  }

  int status;
  waitpid(pid, &status, 0);
  if (WIFEXITED(status))
  {
    rc = WEXITSTATUS(status);
    printf("%d/%d: child exited with status %d\n",
           rc, shmem_my_pe(), shmem_n_pes());
  }
  else if (WIFSIGNALED(status))
  {
    rc = WTERMSIG(status);
    printf("%d/%d: child terminated by signal %d\n",
           rc, shmem_my_pe(), shmem_n_pes());
  }

  shmem_finalize();

  return 0;
}

Where running it (successfully) should show something like:

$ srun -n 2 ./fork-test 1 # 1 -> fork, 2 -> vfork, 3 -> posix_spawn
Hello, from the child process
Hello, from the child process
0/2: child exited with status 0
1/2: child exited with status 0

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants