Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Defining the number of cores to utilize. #8

Open
jlaura opened this issue Aug 8, 2013 · 4 comments
Open

Defining the number of cores to utilize. #8

jlaura opened this issue Aug 8, 2013 · 4 comments

Comments

@jlaura
Copy link
Member

jlaura commented Aug 8, 2013

We need to be careful using (ncores -1) as the number of processing cores. This does not work on a dual core machine when we use slice notation, i.e.

step = nShapes / (cores - 1)
start = range(0, nShapes, step)[0:-1]
end = start[1:]
end.append(nShapes)
slices = zip(start, end)

for c in range(cores-1):
    pids = range(slices[c][0], slices[c][1]) #Here we through an error as c is 0.

Probably the best bet, across pysal, is something like:

import multiprocessing as mp

def my_func(arg1,arg2,kwarg=kword,cores=None):
    if cores = None:
        cores = mp.cpu_count()
@pedrovma
Copy link
Member

pedrovma commented Aug 8, 2013

On spreg we've adopted as default:

def __init__(self, y, x, regimes, w, cores=None):
    pool = mp.Pool(cores) 

I don't think the 'if statement' is needed and haven't used it so far. In mp.Pool(processes), processes is the number of worker processes to use. If processes is None then the number returned by cpu_count() is used. So that if statement seems redundant.
http://docs.python.org/2/library/multiprocessing.html#module-multiprocessing.pool

@jlaura
Copy link
Member Author

jlaura commented Aug 8, 2013

That works for instances where map is being used without explicitly defining a chunk size, but fails if we need to determine a chunk size or a step size for apply_async, i.e. we could not compute the first block of code in the initial post since cores would be None.

@dfolch
Copy link
Member

dfolch commented Aug 8, 2013

  1. Is this thread concerning parallel pysal (pPysal) or regular pysal? It seems that regular pysal should default to use one core, and pPysal to all cores on the machine.

  2. A middle ground might be designating a pysal and pPysal global variable called CORES. At instantiation this is automatically set to mp.cpu_count().

    2a. For pPysal all relevant functions are defaulted to cores=CORES.
    2b. For pysal all relevant functions are defaulted to cores=1. But the user could always pass pysal.CORES to cores to get the maximum.

  3. Do we want to throw a warning if the user asks for 53 cores on his 4 core machine? Multiprocessing seems to be robust to this mistake, and does not crash (or throw a warning for that matter). For Jay's case this would mess up the chunking... so some kind of check would still need to be done, whether the user is warned or not.

@jlaura
Copy link
Member Author

jlaura commented Aug 8, 2013

  1. We have a mix currently, with some spreg stuff already using multiprocessing (released in 1.6) and all of the pPysal stuff using multiprocessing. Defaulting to 1 will kill some of the code as written. I wonder if other projects have gone down this road and a community standard is starting to emerge (a PEP maybe?)
  2. Can we call a function in a list of function args? Something like:
def my_func(arg1, arg2, cores=mp.cpu_count()):
    pass

I am hesitant to just default to 1. In cases where we want to support iPython integration 1 will crash on a dual core machine. In cases where we access slice by index (original post) 1 will crash.

  1. I think that this is a good idea. Someone passing more cores than they have will (likely) see a speed decrease since the chunks will get really small and the overhead from spawning the children will start to increase. So a warning (or a silent check) might be in order.

This hits what I see as the root questions: Is PySAL going to support multiprocessing in trunk? (It looks like yes.) If so, are we going to make it a black box that just works, or leave the interfacing to the user. The former requires that we perform these checks, etc. The later assumes that the dev. using the library is fluent enough in the multiprocessing library not to break something.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants