Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for checking error field on traceroute.Reply #29

Open
jmeggitt opened this issue Nov 16, 2022 · 7 comments
Open

Support for checking error field on traceroute.Reply #29

jmeggitt opened this issue Nov 16, 2022 · 7 comments

Comments

@jmeggitt
Copy link

Issue

RIPE Atlas probes will occasionally emit hop replies containing an "error" field. This behavior is not documented for any firmware version on https://atlas.ripe.net/docs/apis/result-format/. Upon some investigation, some of these errors can be attributed to the following probe measurement code.
https://github.com/RIPE-NCC/ripe-atlas-probe-measurements/blob/master/eperd/traceroute.c#L636-L643
According to git blame, this has and continues to be part of the probe behavior for over 10 years now.

This functionality to detect this field is necessary to verify whether past and current traceroute measurement data is effected.

Effected Measurement Examples

I dumped every measurement from one of the RIPE Atlas hourly traceroute dump files (traceroute-2022-10-14T0400.bz2) that contained this field found. This data is stored as newline delimited JSON and can be found at https://gist.github.com/jmeggitt/11fba9f7fa539e8a4fdae1e231ec8fa1. This appears to be extremely rare and only occurred in 1690 of the 8,912,306 traceroute measurements in that data file (0.019%).

Values

As far as I have seen, this field is always a string when it appears. This appears to be consistent with the probe code above.

count   value
      9 "bind failed: Address already in use"
     11 "bind failed: Address not available"
     47 "bind failed: Cannot assign requested address"
    334 "bind failed: Invalid argument"
    804 "sendto failed: Network is unreachable"
    104 "sendto failed: Network unreachable"
    364 "sendto failed: Operation not permitted"
     17 "sendto failed: Permission denied"

Examples

Here are a couple measurements I arbitrarily chose to show off what they look like in the context of the data.

{
  "af": 6,
  "dst_addr": "2001:500:2::c",
  "dst_name": "2001:500:2::c",
  "endtime": 1665721544,
  "from": "2001:bc8:62c:2545::1",
  "fw": 5020,
  "lts": 11,
  "msm_id": 6011,
  "msm_name": "Traceroute",
  "mver": "2.2.0",
  "paris_id": 2,
  "prb_id": 1000410,
  "proto": "UDP",
  "result": [
    {
      "hop": 1,
      "result": [
        {
          "error": "sendto failed: Operation not permitted"
        }
      ]
    }
  ],
  "size": 40,
  "src_addr": "2001:bc8:62c:2545::1",
  "timestamp": 1665721544,
  "type": "traceroute"
}
{
  "af": 4,
  "dst_addr": "46.101.130.201",
  "dst_name": "46.101.130.201",
  "endtime": 1665722912,
  "from": "170.39.226.151",
  "fw": 5040,
  "group_id": 29556742,
  "lts": 1,
  "msm_id": 29556742,
  "msm_name": "Traceroute",
  "mver": "2.4.1",
  "paris_id": 14,
  "prb_id": 6927,
  "proto": "ICMP",
  "result": [
    {
      "hop": 1,
      "result": [
        {
          "x": "*"
        },
        {
          "x": "*"
        },
        {
          "x": "*"
        }
      ]
    },
    {
      "hop": 2,
      "result": [
        {
          "error": "sendto failed: Network is unreachable"
        }
      ]
    }
  ],
  "size": 48,
  "src_addr": "170.39.226.151",
  "timestamp": 1665722900,
  "type": "traceroute"
}
{
  "af": 6,
  "dst_addr": "2a00:74c0:0:2::20",
  "dst_name": "2a00:74c0:0:2::20",
  "endtime": 1665722173,
  "from": "2a05:f6c7:3853:0:eade:27ff:fe69:dd4e",
  "fw": 5070,
  "group_id": 25639804,
  "lts": 38,
  "msm_id": 25639804,
  "msm_name": "Traceroute",
  "mver": "2.6.1",
  "paris_id": 7,
  "prb_id": 22203,
  "proto": "ICMP",
  "result": [
    {"hop":1,"result":[{"from":"2a05:f6c7:3853:0:1e74:dff:fec3:e2f8","rtt":0.948,"size":96,"ttl":255},{"from":"2a05:f6c7:3853:0:1e74:dff:fec3:e2f8","rtt":0.828,"size":96,"ttl":255},{"from":"2a05:f6c7:3853:0:1e74:dff:fec3:e2f8","rtt":0.759,"size":96,"ttl":255}]},
    {"hop":2,"result":[{"from":"2a05:f6c0:1::18","rtt":3.658,"size":96,"ttl":63},{"from":"2a05:f6c0:1::18","rtt":4.791,"size":96,"ttl":63},{"from":"2a05:f6c0:1::18","rtt":3.415,"size":96,"ttl":63}]},
    {"hop":3,"result":[{"from":"2a05:f6c0:2:23::1","rtt":5.209,"size":96,"ttl":62},{"from":"2a05:f6c0:2:23::1","rtt":4.848,"size":96,"ttl":62},{"from":"2a05:f6c0:2:23::1","rtt":5.137,"size":96,"ttl":62}]},
    {"hop":4,"result":[{"from":"2001:6c8:81:100::1a1","rtt":4.867,"size":96,"ttl":252},{"from":"2001:6c8:81:100::1a1","rtt":18.54,"size":96,"ttl":252},{"from":"2001:6c8:81:100::1a1","rtt":4.106,"size":96,"ttl":252}]}, 
    {"hop":5,"result":[{"from":"2001:6c8:40::1e","rtt":4.761,"size":96,"ttl":250},{"from":"2001:6c8:40::1e","rtt":4.698,"size":96,"ttl":250},{"from":"2001:6c8:40::1e","rtt":5.275,"size":96,"ttl":250}]},
    {
      "hop": 6,
      "result": [
        {
          "x": "*"
        },
        {
          "error": "sendto failed: Network is unreachable"
        }
      ]
    }
  ],
  "size": 48,
  "src_addr": "2a05:f6c7:3853:0:eade:27ff:fe69:dd4e",
  "timestamp": 1665722169,
  "type": "traceroute"
}
@jelu
Copy link
Member

jelu commented Nov 16, 2022

This looks more like a bug in the Atlas code to me. Have you reported this to RIPE?

For example:

    {
      "hop": 2,
      "result": [
        {
          "error": "sendto failed: Network is unreachable"
        }
      ]
    }

Should really be:

    {
      "hop": 2,
      "error": "sendto failed: Network is unreachable"
    }

And that would be correct according to doc for v4400.

@jmeggitt
Copy link
Author

I was generally thinking the same thing. However, one complication is around how to handle an error that occurs in a later reply. If a hop already has one or two valid replies, then it may not make sense to mark the entire hop as having errored.

    {
      "hop": 6,
      "result": [
        {
          "x": "*"
        },
        {
          "error": "sendto failed: Network is unreachable"
        }
      ]
    }

@jmeggitt
Copy link
Author

jmeggitt commented Nov 16, 2022

I have not yet raised an issue related to this field with RIPE Atlas. As it stands there is a decent lead time on a fix being created, implemented, and deployed to their probes. Even if they are can finish the deployment of a patch by tomorrow, nearly all of the previous measurement data is still effected so it would be helpful to have a way of identifying it until a more permanent solution is implemented.

I am also somewhat unclear on if this is actually a bug in the probe software or the API documentation. If we assume it is intentional then the 3 cases for Timeout, Error, and Reply would closely match the ping measurement results structure.

@jelu
Copy link
Member

jelu commented Nov 16, 2022

I'd like to not deviate from the documentation, please try and get them to update that first.

@jelu
Copy link
Member

jelu commented Jan 9, 2023

@jmeggitt Any progress on updating the documentation?

@jelu
Copy link
Member

jelu commented Jan 16, 2023

@jmeggitt ping?

@jmeggitt
Copy link
Author

jmeggitt commented Jan 16, 2023

Sorry about that. I saw your previous ping, but got distracted with other work. I have notified them about the issue, but due to limited resources to address the issue I would not expect to see any updates to the documentation anytime soon. I have not been pushing the issue other than bringing it to their attention, so I imagine it is still in their backlog.

This issue can be found on RIPE-NCC/ripe-atlas-probe-measurements#14. However, if you want to get in contact with them or ask about the status of the issue then you will likely have more success directly emailing them at [email protected] to create a ticket in their system.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants