Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error when using hardware path #40

Open
chen982 opened this issue Nov 22, 2023 · 9 comments
Open

Error when using hardware path #40

chen982 opened this issue Nov 22, 2023 · 9 comments
Assignees

Comments

@chen982
Copy link

chen982 commented Nov 22, 2023

When I run the command, It shows error like this:
./ll_crc_example hardware_path
The example will be run on the hardware path.
Starting CRC job example.
Caclulating CRC for region of size 1KB.
An error (100) occured during job execution.

So I try 2 steps:

  1. Check the .so is ok
    ldd /usr/bin/accel-config
    linux-vdso.so.1 (0x00007fffe05d5000)
    libaccel-config.so.1 => /usr/lib64/libaccel-config.so.1 (0x00007f14c4bdf000)
    libjson-c.so.4 => /lib/x86_64-linux-gnu/libjson-c.so.4 (0x00007f14c4bba000)
    libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007f14c49c8000)
    /lib64/ld-linux-x86-64.so.2 (0x00007f14c4c0e000)
    2.sudo python3 accel_conf.py --load=../configs/1n1d1e1w-s-n1.conf
    Filter:
    No active devices
    Loading configuration - done
    Additional configuration steps
    Force block on fault: False
    Enabling configured devices
    dsa0 - error
    wq0.0 - error

failed in dsa0/wq0.0
enabled 0 wq(s) out of 1


Checking configuration
No active devices

How should I do to fix it? And after step2 should it be ok to run the command ./ll_crc_example hardware_path?

@chen982
Copy link
Author

chen982 commented Nov 22, 2023

When I try to config manually, It shows something different:
sudo accel-config load-config -c ../configs/1n1d1e1w-s-n1.conf

sudo accel-config enable-device dsa0
failed in dsa0
enabled 0 device(s) out of 1
Error[ 0x14] dsa0: Sum of WQCFG size fields out of range

if decrease the size, it shows wq group config error, So seems the config is not fitable?
Then I update it manually to enable a share wq successfully.
Then I rerun ./ll_crc_example hardware_path . It remain to show An error (100) occured during job execution. The accel list info like this:
[
{
"dev":"dsa0",
"read_buffer_limit":0,
"max_groups":4,
"max_work_queues":8,
"max_engines":4,
"work_queue_size":128,
"numa_node":0,
"gen_cap":"0x40915f0107",
"version":"0x100",
"state":"enabled",
"max_read_buffers":96,
"max_batch_size":1024,
"ims_size":2048,
"max_transfer_size":2147483648,
"configurable":1,
"pasid_enabled":1,
"cdev_major":244,
"clients":0,
"groups":[
{
"dev":"group0.0",
"read_buffers_reserved":0,
"use_read_buffer_limit":0,
"read_buffers_allowed":8,
"traffic_class_a":0,
"traffic_class_b":1,
"grouped_workqueues":[
{
"dev":"wq0.0",
"mode":"shared",
"size":16,
"group_id":0,
"priority":10,
"block_on_fault":1,
"max_batch_size":1024,
"max_transfer_size":2147483648,
"cdev_minor":0,
"type":"user",
"name":"app1",
"threshold":15,
"ats_disable":0,
"state":"enabled",
"clients":0
}
],
"grouped_engines":[
{
"dev":"engine0.0",
"group_id":0
},
{
"dev":"engine0.1",
"group_id":0
}
]
},
{
"dev":"group0.1",
"read_buffers_reserved":0,
"use_read_buffer_limit":0,
"read_buffers_allowed":8,
"traffic_class_a":0,
"traffic_class_b":1,
"grouped_engines":[
{
"dev":"engine0.2",
"group_id":1
},
{
"dev":"engine0.3",
"group_id":1
}
]
},
{
"dev":"group0.2",
"read_buffers_reserved":0,
"use_read_buffer_limit":0,
"read_buffers_allowed":8,
"traffic_class_a":0,
"traffic_class_b":1
},
{
"dev":"group0.3",
"read_buffers_reserved":0,
"use_read_buffer_limit":0,
"read_buffers_allowed":8,
"traffic_class_a":0,
"traffic_class_b":1
}
]
}
]

@mzhukova
Copy link
Contributor

mzhukova commented Nov 22, 2023

Hi @chen982, are you using this config file?
Also, what kind of hardware do you have?

@chen982
Copy link
Author

chen982 commented Nov 22, 2023

Hi @chen982, are you using this config file? Also, what kind of hardware do you have?

yeah,i am using this config for no success。and the figured the dsa like that i post。but then run the hardware path to get 100。what should be the steps to use it?

@mzhukova
Copy link
Contributor

Hi @chen982, are you using this config file? Also, what kind of hardware do you have?

yeah,i am using this config for no success。and the figured the dsa like that i post。but then run the hardware path to get 100。what should be the steps to use it?

Could you please do accel-config --version?

@chen982
Copy link
Author

chen982 commented Nov 22, 2023

你好@chen982,你是不是用这个配置文件? 还有,你有什么样的硬件?

是的,我使用此配置没有成功。和想通DSA一样,我post.but然后运行硬件路径得到100.what应该是使用它的步骤?

你能不能accel-config --version?

i install the accel-config from latest github for v4.1.3

@mzhukova
Copy link
Contributor

@chen982 I think there are a couple of issues going on.

First of all, there is a configuring issue. It is confusing to me that you're not able to configure (you're getting "failed device" message), but you have non-empty accel-config list output. Also, this output doesn't seem to match what we have in config file.

I would recommend doing the following command in order to disable your current configuration:

sudo accel-config disable-wq dsa0/wq0.0
sudo accel-config disable-device dsa0

And then repeating sudo accel-config load-config -c ../configs/1n1d1e1w-s-n1.conf (with DML config file) etc.

If and only if this is resolved (meaning you're able to configure device correctly with DML config file) and you're still getting error code 100 out of DML, you might want to check that LD_LIBRARY_PATH include the location of libaccel-config library.

@chen982
Copy link
Author

chen982 commented Nov 22, 2023

@chen982 I think there are a couple of issues going on.

First of all, there is a configuring issue. It is confusing to me that you're not able to configure (you're getting "failed device" message), but you have non-empty accel-config list output. Also, this output doesn't seem to match what we have in config file.

I would recommend doing the following command in order to disable your current configuration:

sudo accel-config disable-wq dsa0/wq0.0
sudo accel-config disable-device dsa0

And then repeating sudo accel-config load-config -c ../configs/1n1d1e1w-s-n1.conf (with DML config file) etc.

If and only if this is resolved (meaning you're able to configure device correctly with DML config file) and you're still getting error code 100 out of DML, you might want to check that LD_LIBRARY_PATH include the location of libaccel-config library.

yeah, because when i use the config in dml , it shows error, so i use my own shared wq config to enable it successfully. And then not work for dml hardware path, it worked for my other job . So what version of kernel and idxd-config should i use. I am now using idxd-driver stage2.5(linux 5.12-rc8+) version kernel and 4.1.3 version accel-config . i think it should be the kernel source to get this problem. Can you give me the kernel and kernelconfig file that can properly run the dml?

@greenhandzpx
Copy link

I've encountered the same problem. Have u solved it ?

@abdelrahim-hentabli abdelrahim-hentabli self-assigned this Oct 23, 2024
@abdelrahim-hentabli
Copy link
Contributor

yeah, because when i use the config in dml , it shows error, so i use my own shared wq config to enable it successfully. And then not work for dml hardware path, it worked for my other job . So what version of kernel and idxd-config should i use. I am now using idxd-driver stage2.5(linux 5.12-rc8+) version kernel and 4.1.3 version accel-config . i think it should be the kernel source to get this problem. Can you give me the kernel and kernelconfig file that can properly run the dml?

Could you share which config file you are using that does work?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants