Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KeyError job_dict["state"] + strange DB Entry #176

Open
oleweidner opened this issue Feb 18, 2014 · 3 comments
Open

KeyError job_dict["state"] + strange DB Entry #176

oleweidner opened this issue Feb 18, 2014 · 3 comments
Assignees
Labels

Comments

@oleweidner
Copy link
Contributor

This error was reported in bigjob-users by Scott Michael:

KeyError in agent:

  File "/N/soft/rhel6/python/2.7.3/lib/python2.7/site-packages/threadpool.py", line 156, in run
    result = request.callable(*request.args, **request.kwds)
  File "/N/soft/rhel6/python/2.7.3/lib/python2.7/site-packages/bigjob/bigjob_agent.py", line 720, in start_new_job_in_thread
    if(job_dict["state"]==str(bigjob.state.Unknown)):
KeyError: 'state'
Traceback (most recent call last):
  File "/N/soft/rhel6/python/2.7.3/lib/python2.7/site-packages/threadpool.py", line 156, in run
    result = request.callable(*request.args, **request.kwds)
  File "/N/soft/rhel6/python/2.7.3/lib/python2.7/site-packages/bigjob/bigjob_agent.py", line 720, in start_new_job_in_thread
    if(job_dict["state"]==str(bigjob.state.Unknown)):
KeyError: 'state'

Strange CU description in Redis:

instead of:

 {'Executable': '/N/dc2/projects/BDBS/cijohnson//dorunner.sh', 
'WorkingDirectory': 
'/N/dc2/projects/BDBS/cijohnson//./lp4.0058bm6.8292', 
'NumberOfProcesses': '1', 'start_time': '1390836474.31', 
'Environment': "['TASK_NO=4687']", 'state': 'Unknown', 'Arguments': 
"['/N/dc2/projects/BDBS/cijohnson/./lp4.0058bm6.8292 
tu1783717_58.in\\n']", 'Error': 'tu1783717_58.err', 'Output': 
'tu1783717_58.out', 'job-id': 
'sj-976b4976-8767-11e3-adde-001fc6d94bec', 'SPMDVariation': 'single'} 

this:

 {'a': 'd', 'c': '-', 'b': 'e', 'd': 'e', 'f': 'c', '-': '0', '3': 
'-', '1': 'e', '0': '1', 's': 'j', '7': 'f', '6': 'd', '9': '4', '8': 
'7'} 
@oleweidner
Copy link
Contributor Author

The complete job script:

import os
import commands
import sys
import pilot
import traceback

""" DESCRIPTION: Tutorial 1: A Simple Workload 
Note: User must edit USER VARIABLES section
This example will not run if these values are not set.
"""

# ---------------- BEGIN REQUIRED PILOT SETUP -----------------

# Distributed Coordination Service - Redis server and password
REDIS_PWD   = "ILikeBigJob_wITH-REdIS"# Fill in the password to your redis server
REDIS_URL   = "redis://%[email protected]:6379" % REDIS_PWD

# Resource Information
HOSTNAME     = "localhost"# Remote Resource URL
USER_NAME    = "scamicha"# Username on the remote resource
SAGA_ADAPTOR = "pbs"# Name of the SAGA adaptor, e.g. fork, sge, pbs, slurm, etc.
# NOTE: See complete list of BigJob supported SAGA adaptors at:
# http://saga-project.github.io/BigJob/sphinxdoc/tutorial/table.html

# Fill in queue and allocation for the given resource 
# Note: Set fields to "None" if not applicable
QUEUE        = "batch" #Add queue you want to use
PROJECT      = "None"# Add project / allocation / account to charge

WALLTIME     = 1440# Maximum Runtime (minutes) for the Pilot Job

WORKDIR      = os.getenv("HOME")+"/agent" # Path of Resource Working Directory
# This is the directory where BigJob will store its output and error files

SPMD_VARIATION = "None"# Specify the WAYNESS of SGE clusters ONLY, valid input '12way' for example

PROCESSES_PER_NODE = 8# Valid on PBS clusters ONLY - this is the number of processors per node. One processor core is treated as one processor on PBS; e.g. a node with 8 cores has a maximum ppn=8

PILOT_SIZE = 128# Number of cores required for the Pilot-Job

# Job Information

datadir = "/N/dc2/projects/BDBS/cijohnson/"
files = []
os.chdir(datadir)
input = open('files.todo.01','r')
for line in input:
    files.append(line)
NUMBER_JOBS=len(files)

# Continue to USER DEFINED TASK DESCRIPTION to add 
# the required information about the individual tasks.

# ---------------- END REQUIRED PILOT SETUP -----------------
#

def main():
    try:
        # this describes the parameters and requirements for our pilot job
        pilot_description = pilot.PilotComputeDescription()
        pilot_description.service_url = "%s://%s@%s" %  (SAGA_ADAPTOR,USER_NAME,HOSTNAME)
        pilot_description.queue = QUEUE
        pilot_description.number_of_processes = PILOT_SIZE
        pilot_description.working_directory = WORKDIR
        pilot_description.walltime = WALLTIME
    pilot_description.processes_per_node = PROCESSES_PER_NODE
    pilot_description.spmd_variation = SPMD_VARIATION

        # create a new pilot job
        pilot_compute_service = pilot.PilotComputeService(REDIS_URL)
        pilotjob = pilot_compute_service.create_pilot(pilot_description)


        # submit tasks to pilot job
        tasks = list()
        for i in range(0,NUMBER_JOBS-1):
        directory = files[i].rsplit('/',1)[0]
        file = files[i].rsplit('/',1)[1]
    # -------- BEGIN USER DEFINED TASK DESCRIPTION --------- #
            task_desc = pilot.ComputeUnitDescription()
            task_desc.executable = datadir+'/dorunner.sh'
            task_desc.arguments = [datadir+directory+' '+file]
            task_desc.environment = {'TASK_NO': i}
            task_desc.number_of_processes = 1
        task_desc.spmd_variation = 'single' # Valid values are single or mpi
        task_desc.working_directory=datadir+'/'+directory
            task_desc.output = file.rsplit('.',1)[0]+".out" 
            task_desc.error = file.rsplit('.',1)[0]+".err"
    # -------- END USER DEFINED TASK DESCRIPTION --------- #

            task = pilotjob.submit_compute_unit(task_desc)
            print "* Submitted task '%s' with id '%s' to %s" % (i, task.get_id(), HOSTNAME)
            tasks.append(task)

        print "Waiting for tasks to finish..."
        pilotjob.wait()

        return(0)

    except Exception, ex:
            print "AN ERROR OCCURRED: %s" % ((str(ex)))
            # print a stack trace in case of an exception -
            # this can be helpful for debugging the problem
            traceback.print_exc()
            return(-1)

    finally:
        # alway try to shut down pilots, otherwise jobs might end up
        # lingering in the queue
        print ("Terminating BigJob...")
        pilotjob.cancel()
        pilot_compute_service.cancel()


if __name__ == "__main__":
    sys.exit(main())

@drelu
Copy link
Member

drelu commented Feb 18, 2014

Sorry, I cannot replicate this based on this script. I neither have the executable nor the input files. This needs to be narrowed down.

@scamicha
Copy link

Hi there,

I'm the user that originally wrote into the mailing list with this issue. I don't think you'll be able to exactly replicate this problem as I was attempting to run ~120K subjobs and the input data set is a little over 4TB in size. I'd be happy to get it to you but, it's probably technical infeasible at best. I was able to run the pilot job with the debug level at 5. The log file is located at https://iu.box.com/s/3611nik4aoop686vbrn9

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants