Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Task '/bin/sleep 60' has failed #76

Open
gabrielecastellano opened this issue Nov 16, 2020 · 1 comment
Open

Task '/bin/sleep 60' has failed #76

gabrielecastellano opened this issue Nov 16, 2020 · 1 comment

Comments

@gabrielecastellano
Copy link

Hello everyone,
I managed to run firmament using the provided docker image.
When I run the container, it gives me the following error (don't know if it is related to my issue):

$ docker run -p 9999:9999 -w /firmament camsas/firmament:dev /firmament/build/src/coordinator --scheduler flow --flow_scheduling_cost_model 6 --listen_uri tcp:0.0.0.0:8081 --http_ui_port 9999 --task_lib_dir=/firmament/build/src
Unexpected end of /proc/mounts line `overlay / overlay rw,relatime,lowerdir=/var/lib/docker/overlay2/l/H2GR6RBYUIPHBXDMSGKPBAYWNE:/var/lib/docker/overlay2/l/NKEZN6MLXD4DGK5HNNI2K4SN7K:/var/lib/docker/overlay2/l/5H5GK4TBC5MY7NFNYEW2P7MPRP:/var/lib/docker/overlay2/l/2DVGBZKGQHVXVENMWHAW3HNEGB:/var/lib/docker/overlay2/l/DA5VWJ6IOM3MFNW3T6VLSZ4ZDR:/var/lib/docker/overlay2/l/NFSSHKRHC7XPWN7BXCLFDMXHF6:/var/lib/docker/overlay2/l/C4RYQ3MDIDZ376KHATSEPRHOOC:/var/lib/docker/overlay2/l/23CTT2D5BDVQOVVUTHAGP4SPKX:/var/lib/docker/overlay2/l/UTO3PZRTFU4CU'

Despite this, the server seems running correctly, and I am able to access the gui at http://:9999/

However, when I tried to submit a job with
python scripts/job/job_submit.py 172.17.0.2 9999 /bin/sleep 60
I got the following error:
E1116 17:19:04.534961 6 task_health_checker.cc:51] Task 18085502784089753274 has failed!

Here is /tmp/coordinator.INFO:

I1116 17:16:14.514029     1 coordinator_main.cc:36] Firmament coordinator starting ...
I1116 17:16:14.531463     1 coordinator.cc:120] Using Quincy-style min cost flow-based scheduler.
I1116 17:16:14.531641     1 coordinator.cc:133] Coordinator starting on host tcp:0.0.0.0:8081, UUID 42f151f8-deef-46b8-b8a6-88ab53e5e6a7
I1116 17:16:14.531744     1 coordinator.cc:221] Detecting resource topology:
I1116 17:16:14.531754     1 topology_manager.cc:212] *** LEVEL: 0
I1116 17:16:14.531767     1 topology_manager.cc:217] Index: 0: Machine#0(7470MB)
I1116 17:16:14.531774     1 topology_manager.cc:212] *** LEVEL: 1
I1116 17:16:14.531781     1 topology_manager.cc:217] Index: 0: Socket#0
I1116 17:16:14.531786     1 topology_manager.cc:212] *** LEVEL: 2
I1116 17:16:14.531793     1 topology_manager.cc:217] Index: 0: L3(6144KB)
I1116 17:16:14.531800     1 topology_manager.cc:212] *** LEVEL: 3
I1116 17:16:14.531805     1 topology_manager.cc:217] Index: 0: L2(256KB)
I1116 17:16:14.531812     1 topology_manager.cc:217] Index: 1: L2(256KB)
I1116 17:16:14.531819     1 topology_manager.cc:217] Index: 2: L2(256KB)
I1116 17:16:14.531826     1 topology_manager.cc:217] Index: 3: L2(256KB)
I1116 17:16:14.531831     1 topology_manager.cc:212] *** LEVEL: 4
I1116 17:16:14.531838     1 topology_manager.cc:217] Index: 0: L1d(32KB)
I1116 17:16:14.531846     1 topology_manager.cc:217] Index: 1: L1d(32KB)
I1116 17:16:14.531852     1 topology_manager.cc:217] Index: 2: L1d(32KB)
I1116 17:16:14.531859     1 topology_manager.cc:217] Index: 3: L1d(32KB)
I1116 17:16:14.531864     1 topology_manager.cc:212] *** LEVEL: 5
I1116 17:16:14.531870     1 topology_manager.cc:217] Index: 0: Core#0
I1116 17:16:14.531877     1 topology_manager.cc:217] Index: 1: Core#1
I1116 17:16:14.531883     1 topology_manager.cc:217] Index: 2: Core#2
I1116 17:16:14.531889     1 topology_manager.cc:217] Index: 3: Core#3
I1116 17:16:14.531894     1 topology_manager.cc:212] *** LEVEL: 6
I1116 17:16:14.531900     1 topology_manager.cc:217] Index: 0: PU#0
I1116 17:16:14.531908     1 topology_manager.cc:217] Index: 1: PU#1
I1116 17:16:14.531913     1 topology_manager.cc:217] Index: 2: PU#2
I1116 17:16:14.531920     1 topology_manager.cc:217] Index: 3: PU#3
I1116 17:16:14.531926     1 coordinator.cc:176] Found 4 local PUs.
I1116 17:16:14.531932     1 coordinator.cc:177] Resource URI is tcp:0.0.0.0:8081
I1116 17:16:14.534741     1 coordinator_http_ui.cc:1321] Coordinator HTTP interface up!
I1116 17:16:22.949242    16 coordinator_http_ui.cc:1226] [HTTPREQ] Serving /task/
I1116 17:16:23.151162     9 coordinator_http_ui.cc:1226] [HTTPREQ] Serving /stats/
W1116 17:16:23.151223     9 coordinator_http_ui.cc:834] Invalid stats request!
I1116 17:17:25.160835     9 coordinator_http_ui.cc:1226] [HTTPREQ] Serving /task/
I1116 17:17:25.308948    16 coordinator_http_ui.cc:1226] [HTTPREQ] Serving /stats/
W1116 17:17:25.308990    16 coordinator_http_ui.cc:834] Invalid stats request!
I1116 17:18:28.951195    14 coordinator_http_ui.cc:1226] [HTTPREQ] Serving /task/
I1116 17:18:29.114184    16 coordinator_http_ui.cc:1226] [HTTPREQ] Serving /stats/
W1116 17:18:29.114243    16 coordinator_http_ui.cc:834] Invalid stats request!
I1116 17:18:57.184258    16 coordinator_http_ui.cc:1226] [HTTPREQ] Serving /job/submit/
I1116 17:18:57.198359    16 coordinator.cc:865] NEW JOB: 1468db75-43d3-417e-9e26-f9843eba8c8e
I1116 17:18:57.198387    16 flow_scheduler.cc:405] START SCHEDULING (via 1468db75-43d3-417e-9e26-f9843eba8c8e)
W1116 17:18:57.198391    16 flow_scheduler.cc:406] This way of scheduling a job is slow in the flow scheduler! Consider using ScheduleAllJobs() instead.
I1116 17:18:57.198488    16 utils.cc:341] External execution of command: build/third_party/cs2/src/cs2/cs2.exe
I1116 17:18:57.475673    20 local_executor.cc:393] COMMAND LINE for task 18085502784089753274: perf stat -o /tmp/firmament-perf/aa1d8806-8de1-4c73-b634-214341eed606-18085502784089753274.perf -e cpu-clock,task-clock,context-switches,cpu-migrations,page-faults,cycles,instructions,branches,branch-misses,cache-misses,cache-references,stalled-cycles-frontend,stalled-cycles-backend,node-loads,node-load-misses -- /bin/sleep 60
I1116 17:18:57.476095    16 coordinator.cc:911] Attempted to schedule job 1468db75-43d3-417e-9e26-f9843eba8c8e, successfully scheduled 1 tasks.
E1116 17:19:04.534961     6 task_health_checker.cc:51] Task 18085502784089753274 has failed!
I1116 17:19:04.535176     6 event_driven_scheduler.cc:144] Task 18085502784089753274 has not reported heartbeats for 60s and its handler thread has exited. Declaring it FAILED!
I1116 17:19:04.535195     6 local_executor.cc:145] kill(2) for task 18085502784089753274 returned -1

And here is what I get from the GUI:
firmament

By clicking both on the stderr link, I get:

E1116 17:18:57.757828 21 local_executor.cc:443] execvp failed for task command 'perf stat -o /tmp/firmament-perf/aa1d8806-8de1-4c73-b634-214341eed606-18085502784089753274.perf -e cpu-clock,task-clock,context-switches,cpu-migrations,page-faults,cycles,instructions,branches,branch-misses,cache-misses,cache-references,stalled-cycles-frontend,stalled-cycles-backend,node-loads,node-load-misses -- /bin/sleep 60 ': No such file or directory [2]

What am I missing?

Thanks!
Gabriele

@5symx
Copy link

5symx commented Dec 13, 2023

I fixed it by adding aa1d8806-8de1-4c73-b634-214341eed606-18085502784089753274.perf file to the content /tmp/firmament-perf/ in the docker container. It seems like working well.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants