-
Notifications
You must be signed in to change notification settings - Fork 20
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
record indexer errors #79
Conversation
service/src/query_processor.rs
Outdated
metrics::INDEXER_ERROR | ||
.with_label_values(&[&IndexerErrorCode::IE033.to_string()]) | ||
.inc(); | ||
|
||
e |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be cool if indexer_error
would automatically do this.
Ideally, we'd replicate what https://github.com/graphprotocol/indexer/blob/main/packages/indexer-common/src/errors.ts#L181-L183 does:
- Any time there is an internal error, we call
indexer_error
(or anindexer_error!
macro), passing in the code and an optional cause). - This internally constructs an
IndexerError
and also increments a metric. - Whenever we call
indexer_error
, we should log the error as well, like it's done here: https://github.com/graphprotocol/indexer/blob/f20152a28bea60b0f2ed8113dfad2b45b9a500d4/packages/indexer-common/src/indexer-management/allocations.ts#L172-L174
Basically, an indexer error should allow the indexer to detect it via metrics or logs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah cool idea!
I moved metrics recording and logs into indexerError::new
so they are automatically taken care of when an indexerError
is created.
I would also remove the QueryError
enum if it makes sense to you, I think the current variants can simply live as part of IndexerError
let dependencies = pkg | ||
.get("dependencies") | ||
.and_then(|d| d.as_table()) | ||
.expect("Parse package dependencies"); | ||
let indexer_native = dependencies.get("indexer-native").map(|d| { | ||
d.as_str() | ||
.expect("Parse indexer-service dependency version") | ||
.to_string() | ||
}); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since this is only called once on startup, I would just make any errors here fatal, i.e. use expect and no IndexerError
. Only errors while running are worth tracking in my opinion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm a bit confused 🧐 I didn't add IndexerError
here, the only changes were from unwrap
to expect
for a better error message when it fails fatally
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I kept the expect
messages changes and removed IndexerError
from helper fn read_manifest
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm a bit confused 🧐 I didn't add
IndexerError
here, the only changes were fromunwrap
toexpect
for a better error message when it fails fatally
Ah, sorry. I sometimes comment on things even if they were not introduced in the PR itself. Sometimes, a PR is a good opportunity to improve code that was there before.
56c6517
to
68c4880
Compare
Pull Request Test Coverage Report for Build 6704892743
💛 - Coveralls |
common/src/metrics/mod.rs
Outdated
let m = IntCounterVec::new( | ||
Opts::new("indexer_error", "Indexer errors observed over time") | ||
.namespace("indexer") | ||
.subsystem("service"), | ||
&["code"], | ||
) | ||
.expect("Failed to create indexer_error"); | ||
register(Box::new(m.clone())).expect("Failed to register indexer_error counter"); | ||
m |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This would be shorter but is mostly equivalent. I think the namespace
and subsystem
(in other words the indexer_service_
prefix that they result in) is different from what the current indexer-service does, so I'd drop those either way.
register_int_counter_vec!(
"indexer_error",
"Indexer errors observed over time",
&["code"]
);
common/src/metrics/mod.rs
Outdated
#[allow(dead_code)] | ||
pub static REGISTRY: Lazy<Registry> = Lazy::new(Registry::new); | ||
|
||
#[allow(dead_code)] | ||
pub fn register_metrics(registry: &Registry, metrics: Vec<Box<dyn Collector>>) { | ||
for metric in metrics { | ||
registry.register(metric).expect("Cannot register metrics"); | ||
} | ||
} | ||
|
||
/// Register indexer error metrics in Prometheus registry | ||
pub fn register_indexer_error_metrics() { | ||
register_metrics(®ISTRY, vec![Box::new(INDEXER_ERROR.clone())]); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be better to just use the default registry instead of creating our own. I don't recall why we're using a custom one in indexer-service (and also graph-node), but I know that the default registry works well with autometrics
which I believe is used in indexer-service-rs. That's why I think using register_int_counter_vec!
is probably a good idea, because that uses the default registry.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice suggestion! I updated the labeling for both indexer_error
and query
metrics, used the macros, and removed the custom registry fb870a7. The code looks a lot cleaner than before.
The current custom registry should work fine with autometrics
, though we are only using it for encode_global_metrics
and global_metrics_exporter
. I think it could be interesting to use autometrics
for some functions to measure performance, like for execute_paid_query
and execute_free_query
. I don't think we have those measurements in TS but do lmk if you think adding some of those within this PR is a good idea
service/src/metrics/mod.rs
Outdated
/// Start the basic metrics for indexer services | ||
#[allow(dead_code)] | ||
pub fn start_metrics() { | ||
pub fn register_query_metrics() { | ||
register_metrics( | ||
®ISTRY, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here, I'd stick to the default registry.
Err(e) => { | ||
IndexerError::new( | ||
IndexerErrorCode::IE031, | ||
Some(IndexerErrorCause::new( | ||
"Failed to parse receipt for a paid query", | ||
)), | ||
); | ||
|
||
return Err(e); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This combination of QueryError
and IndexerError
makes sense — we're tracking the indexer error internally but return the query error to the client. 👍🏻
Some(IndexerErrorCause::new( | ||
"Failed to execute a paid subgraph query to graph node", | ||
)), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking at this, where we essentially drop the error returned by Graph Node and just log a static string makes me wonder if we should have an IndexerErrorCause
that supports taking other errors (e.g. the std::error::Error
trait) that are then included in the error message.
The TypeScript indexer-service/-common does this by having IndexerErrorCause
be an arbitraty type. Easy in TypeScript of course. Here, what we could do is something like
use std::fmt::{self};
use anyhow::anyhow; // 1.0.75
#[derive(Debug)]
enum IndexerErrorCode {
IE001,
}
struct IndexerErrorCause<T: ToString>(T);
impl<T: ToString> From<T> for IndexerErrorCause<T> {
fn from(t: T) -> Self {
Self(t)
}
}
impl<T: ToString> fmt::Display for IndexerErrorCause<T> {
fn fmt(&self, f: &mut fmt::Formatter) -> fmt::Result {
write!(f, "{}", self.0.to_string())
}
}
struct IndexerError<T: ToString> {
code: IndexerErrorCode,
cause: IndexerErrorCause<T>
}
impl<T> IndexerError<T>
where T: ToString + Into<IndexerErrorCause<T>>
{
pub fn new(code: IndexerErrorCode, cause: T) -> Self {
Self { code, cause: cause.into() }
}
}
impl<T: ToString> fmt::Display for IndexerError<T> {
fn fmt(&self, f: &mut fmt::Formatter) -> fmt::Result {
write!(f, "{:?}: {}", self.code, self.cause.to_string())
}
}
fn main() {
println!("{}", IndexerError::new(IndexerErrorCode::IE001, "str cause"));
println!("{}", IndexerError::new(IndexerErrorCode::IE001, anyhow!("anyhow cause")));
}
You can see this in the playground here: https://play.rust-lang.org/?version=stable&mode=debug&edition=2021&gist=5ff50fb1ffb9afe9574f1087ebb581fa
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using Display
instead of ToString
might be better, I'm not sure which one is more common in situations like this. Perhaps @Theodus knows?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using ToString
is generally preferred. When you implement Display
, you get the impl of ToString
automatically (docs)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But it seems like an alternative would be cause: anyhow::Error
. Since it looks like you're recreating a subset of what anyhow::Error
does, but with the additional overhead of the error type not getting erased. Unless you want separate error types in IndexerError
for some reason.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the detailed review here 🧠 !!
Looking at the TS implementation of subgraph query responses, I realized that we have been incorrectly only counting 200
response from graph node as Successful queries, whereas TS counts all responses when the request doesn't error out (It seems like to me).
When we get error for executing the query, we get QueryError
which can be passed into IndexerErrorCause
directly as IndexerErrorCause
is currently defined as pub struct IndexerErrorCause(Box<dyn Error + Send + Sync>);
.
Updating it to anyhow::Error
could be nice, but I think the current struct accomplishes what we want to achieve here.
For now, I updated in e56b400 so the match statement will count FAILED_QUERIES
only if execute_paid_query
fails and it would log out and respond with "Failed to execute a paid subgraph query to graph node: " + QueryError
back to the user
lmk what you think:)
let dependencies = pkg | ||
.get("dependencies") | ||
.and_then(|d| d.as_table()) | ||
.expect("Parse package dependencies"); | ||
let indexer_native = dependencies.get("indexer-native").map(|d| { | ||
d.as_str() | ||
.expect("Parse indexer-service dependency version") | ||
.to_string() | ||
}); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm a bit confused 🧐 I didn't add
IndexerError
here, the only changes were fromunwrap
toexpect
for a better error message when it fails fatally
Ah, sorry. I sometimes comment on things even if they were not introduced in the PR itself. Sometimes, a PR is a good opportunity to improve code that was there before.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like how this adds better error reporting and still shaves of 20 lines of code 🔥
Briefly reviewed the current error docs. For further recoding of the metrics, I'm thinking to move
indexer_error
metric and metrics services intoindexer_common
and keep receipt and cost model metrics withinindexer_service
. For now I simply added recording of indexer_error at places without much refacotoring.New in indexer-service-rs
Recorded in indexer-service-rs
TODO: add to indexer-service-rs
Not needed in indexer-serivce
Not used anywhere in typescript impl -> Deprecate/remove/refactor
resolves #75, part of #4