-
Notifications
You must be signed in to change notification settings - Fork 77
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Failed to run the simulator #66
Comments
Hi @marceloamaral, This is definitely not expected behavior. @ICGog and I are currently quite busy with a major deadline, but we can dig into this after May 3rd. I saw that you posted a bunch of other comments debugging this (which strangely GitHub doesn't show for me). One reason the |
Hi @ms705, thank you for replying! Unfortunately, I was using an older source code version, the one from the branch https://github.com/Huawei-PaaS/firmament. After I realized that, I have updated to the master version from this repository. The good news is that It solved the previous problem of machine_tmpl->ParseFromFileDescriptor(fd). As you mentioned, it was most probably some paths that were wrong. However, I am still getting some problems... The problem now is: F0427 17:13:55.563627 7817 flow_graph_node.cc:43] Check failed: InsertIfNotPresent(&outgoing_arc_map_, arc->dst_, arc) This error is from the call CHECK(InsertIfNotPresent(&outgoing_arc_map_, arc->dst_, arc)); In a quick "try and error approach", I have disabled this CHECK just in case. However, it just "transferred" the problem to the flowlessly module as shown in the log below. The log shows that the flowlessly module has errors, but the simulation itself keep executing, but the log stop to show new data, maybe it reaches a deadlock situation... I'm not sure... Anyway, i believe that there is a strong constraint in update the arcs in this situation. Might it be related to the simulation parameters? For which parameters can you run it successfully? Please find below the used parameters that I am using and the logs of the execution: --simulation=synthetic simulator.WARNING log/simulator.INFO [...] |
To debug a little bit more; I started to print all the AddArc src dst and type, and it turns out that it is failing when "adding" a scheduled task. Therefore, since the updating a scheduled task is typically related to preemption, I have disabled the preemption from the parameters. As expected, It stopped to fail! Additionally, is it written somewhere how to enable the statistics? I mean, after running the simulation, the some generated trace are empty, such as the "task_usage_stat"... What do the columns mean in the traces? For example, the "scheduler_events.csv" has many columns. results/simu-debug/trace-path/scheduler_events/scheduler_events.csv results/simu-debug/trace-path/machines_to_racks/machines_to_racks.csv results/simu-debug/trace-path/quincy_tasks/quincy_tasks.csv results/simu-debug/trace-path/task_usage_stat/task_usage_stat.csv results/simu-debug/trace-path/jobs_num_tasks/jobs_num_tasks.csv results/simu-debug/trace-path/task_events/part-00000-of-00500.csv results/simu-debug/trace-path/machine_events/part-00000-of-00001.csv results/simu-debug/trace-path/tasks_to_blocks/tasks_to_blocks.csv results/simu-debug/trace-path/task_runtime_events/task_runtime_events.csv results/simu-debug/trace-path/dfs_events/dfs_events.csv |
@marceloamaral The schema of the CSV files generated is generally the same as the schema of the Google cluster trace. We added some additional files, and for those, you'll find the schema in the code that generates the file (or you can ask us). To get data in the |
Hi, I am trying to run the simulation with synthetic trace, but it is failing.
I have tried the solvers (cs2 and flowlessly), and the cost models (0 and 6).
The general configuration that I have used is as shown below:
build-release/src/simulator \
--simulation=synthetic \
--synthetic_num_jobs=100 \
--synthetic_num_machines=10 \
--synthetic_machine_failure_duration=0 \
--synthetic_task_duration=2 \
--synthetic_tasks_per_job=2 \
--runtime=100000000000 \
--scheduler=flow \
--flow_scheduling_cost_model=6 \
--preemption \
--simulated_dfs_type=bounded \
--simulated_block_size=1073741824 \
--max_sample_queue_size=10 \
--solver=cs2 \
--log_solver_stderr \
--max_solver_runtime=100000000000 \
--machine_tmpl_file=../../tests/testdata/mach_16pus.pbin \
--generate_trace \
--generated_trace_path=firmament/results/simu-release/trace-path \
--generate_quincy_cost_model_trace \
--log_dir=firmament/results/simu-release/log \
--quincy_no_scheduling_delay \
--online_factor 1 -v 10
For the cost model 0, its is failing with the following error:
F0425 18:59:30.177265 24372 trivial_cost_model.cc:139] Check failed: leaf_res_ids_->size() >= FLAGS_num_pref_arcs_task_to_res (0 vs. 1)
The traces:
results/simu-release/trace-path/task_events/part-00000-of-00500.csv
1000000,,1,1,,0,,,,,,,
results/simu-release/trace-path/machine_events/part-00000-of-00001.csv
0,1,0,,,
0,2,0,,,
0,3,0,,,
0,4,0,,,
0,5,0,,,
0,6,0,,,
0,7,0,,,
0,8,0,,,
0,9,0,,,
0,10,0,,,
For the cost model 6, its is failing with the following error:
*** Error in `firmament/build-release/src/simulator': corrupted double-linked list: 0x000000000133bdb0 ***
The log is saying:
W0425 18:59:58.946012 24390 trace_generator.cc:264] 100% of tasks are unscheduled
results/simu-release/trace-path/scheduler_events/scheduler_events.csv
1000000,388,0,930,2,0,0,2,2,10,16,0,25,1,0,2,11,1,1,1,2,0,10,0,10,2,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0
1000388,358,0,934,2,0,0,2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
1000746,459,0,1085,2,0,0,2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
1001205,460,0,1067,2,0,0,2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
etc...
results/simu-release/trace-path/task_events/part-00000-of-00500.csv
1000000,,1,1,,0,,,,,,,
1000000,,1,2,,0,,,,,,,
2000000,,2,1,,0,,,,,,,
2000000,,2,2,,0,,,,,,,
3000000,,3,1,,0,,,,,,,
3000000,,3,2,,0,,,,,,,
results/simu-release/trace-path/machine_events/part-00000-of-00001.csv
0,1,0,,,
0,2,0,,,
0,3,0,,,
0,4,0,,,
0,5,0,,,
0,6,0,,,
0,7,0,,,
0,8,0,,,
0,9,0,,,
0,10,0,,,
1012484,10,1,,,
1012484,10,0,,,
For the COCO model, it is generating at least some traces, but for the TRIVIAL, it has almost no trace generated. I am most probably missing some configuration. Please, could you help me?
Thanks!
The text was updated successfully, but these errors were encountered: