-
Notifications
You must be signed in to change notification settings - Fork 16
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
JobCountByState query performance #106
Comments
Thanks for the report! Any chance you can run that SQL query with |
Thanks @ligustah. We're looking into this. If you're able to run this: REINDEX INDEX CONCURRENTLY river_job_prioritized_fetching_index; It'll very likely fix the problem, although if a whole bunch of new jobs churn through the table, it may reoccur. We're looking into a longer term solution. |
This one's related to trying to find a solution for #106. After messing around with query plans a lot on the "count by state" query, I came to the conclusion in the end that Postgres might actually be doing the right thing by falling back to a sequential scan, or at least only the minimally wrong thing. Even forcing the count query to run against a well-used index is a fairly slow operation when there are many jobs in the database. It's hard to provide specifics because caching affects the result so much (so running the same query twice in a row can produce vastly different timings), but I've seen the index version take _longer_ than the seq scan in some cases. So here, I'm proposing a radically different solution in which we add some infrastructure to the River UI API server that lets it run slow queries periodically in the background, then have API endpoints take advantage of those cached results instead of having to run each operation themselves, thereby making their responses ~instant. I've written it such that this caching only kicks in when we know we're working with a very large data set where it actually matters (currently defined as > 1M rows), with the idea being that for smaller databases we'll continue to run queries in-band so that results look as fresh and real-time as possible. To support this, I've had to make some changes to the River UI API server/handler so that it has a `Start` function that can be invoked to start background utilities like the query cache. It's a considerable change, but I think it leaves us in a more sustainable place API-wise because we may want to add other background utilities later on, and returning an `http.Handler` isn't enough because even if you were to start goroutines from `NewHandler`, it's very, very not ideal that there's no way to stop those goroutines again (problematic for anything that wants to check for leaks with goleak). I'm also going to propose that we increase the default API endpoint timeout from 5 seconds to 10 seconds. When I load in 3 to 5 million job rows, I see count queries taking right around that 3 to 5 seconds range. Since the original number of 5 seconds was a little arbitrary anyway, it can't hurt to give those queries a little more leeway. A problem that could still occur even with my proposal here is that if a user starts River UI and then immediately hits the UI, there won't be a cached results yet, and therefore the count query will go to the database directly, and that may still cause a timeout at 5 seconds. I've only applied caching to the count timeout so far, but I've written the `QueryCacher` code such that it can cleanly support other queries if we care to add them.
This one's related to trying to find a solution for #106. After messing around with query plans a lot on the "count by state" query, I came to the conclusion in the end that Postgres might actually be doing the right thing by falling back to a sequential scan, or at least only the minimally wrong thing. Even forcing the count query to run against a well-used index is a fairly slow operation when there are many jobs in the database. It's hard to provide specifics because caching affects the result so much (so running the same query twice in a row can produce vastly different timings), but I've seen the index version take _longer_ than the seq scan in some cases. So here, I'm proposing a radically different solution in which we add some infrastructure to the River UI API server that lets it run slow queries periodically in the background, then have API endpoints take advantage of those cached results instead of having to run each operation themselves, thereby making their responses ~instant. I've written it such that this caching only kicks in when we know we're working with a very large data set where it actually matters (currently defined as > 1M rows), with the idea being that for smaller databases we'll continue to run queries in-band so that results look as fresh and real-time as possible. To support this, I've had to make some changes to the River UI API server/handler so that it has a `Start` function that can be invoked to start background utilities like the query cache. It's a considerable change, but I think it leaves us in a more sustainable place API-wise because we may want to add other background utilities later on, and returning an `http.Handler` isn't enough because even if you were to start goroutines from `NewHandler`, it's very, very not ideal that there's no way to stop those goroutines again (problematic for anything that wants to check for leaks with goleak). I'm also going to propose that we increase the default API endpoint timeout from 5 seconds to 10 seconds. When I load in 3 to 5 million job rows, I see count queries taking right around that 3 to 5 seconds range. Since the original number of 5 seconds was a little arbitrary anyway, it can't hurt to give those queries a little more leeway. A problem that could still occur even with my proposal here is that if a user starts River UI and then immediately hits the UI, there won't be a cached results yet, and therefore the count query will go to the database directly, and that may still cause a timeout at 5 seconds. I've only applied caching to the count timeout so far, but I've written the `QueryCacher` code such that it can cleanly support other queries if we care to add them.
This one's related to trying to find a solution for #106. After messing around with query plans a lot on the "count by state" query, I came to the conclusion in the end that Postgres might actually be doing the right thing by falling back to a sequential scan, or at least only the minimally wrong thing. Even forcing the count query to run against a well-used index is a fairly slow operation when there are many jobs in the database. It's hard to provide specifics because caching affects the result so much (so running the same query twice in a row can produce vastly different timings), but I've seen the index version take _longer_ than the seq scan in some cases. So here, I'm proposing a radically different solution in which we add some infrastructure to the River UI API server that lets it run slow queries periodically in the background, then have API endpoints take advantage of those cached results instead of having to run each operation themselves, thereby making their responses ~instant. I've written it such that this caching only kicks in when we know we're working with a very large data set where it actually matters (currently defined as > 1M rows), with the idea being that for smaller databases we'll continue to run queries in-band so that results look as fresh and real-time as possible. To support this, I've had to make some changes to the River UI API server/handler so that it has a `Start` function that can be invoked to start background utilities like the query cache. It's a considerable change, but I think it leaves us in a more sustainable place API-wise because we may want to add other background utilities later on, and returning an `http.Handler` isn't enough because even if you were to start goroutines from `NewHandler`, it's very, very not ideal that there's no way to stop those goroutines again (problematic for anything that wants to check for leaks with goleak). I'm also going to propose that we increase the default API endpoint timeout from 5 seconds to 10 seconds. When I load in 3 to 5 million job rows, I see count queries taking right around that 3 to 5 seconds range. Since the original number of 5 seconds was a little arbitrary anyway, it can't hurt to give those queries a little more leeway. A problem that could still occur even with my proposal here is that if a user starts River UI and then immediately hits the UI, there won't be a cached results yet, and therefore the count query will go to the database directly, and that may still cause a timeout at 5 seconds. I've only applied caching to the count timeout so far, but I've written the `QueryCacher` code such that it can cleanly support other queries if we care to add them.
This one's related to trying to find a solution for #106. After messing around with query plans a lot on the "count by state" query, I came to the conclusion in the end that Postgres might actually be doing the right thing by falling back to a sequential scan, or at least only the minimally wrong thing. Even forcing the count query to run against a well-used index is a fairly slow operation when there are many jobs in the database. It's hard to provide specifics because caching affects the result so much (so running the same query twice in a row can produce vastly different timings), but I've seen the index version take _longer_ than the seq scan in some cases. So here, I'm proposing a radically different solution in which we add some infrastructure to the River UI API server that lets it run slow queries periodically in the background, then have API endpoints take advantage of those cached results instead of having to run each operation themselves, thereby making their responses ~instant. I've written it such that this caching only kicks in when we know we're working with a very large data set where it actually matters (currently defined as > 1M rows), with the idea being that for smaller databases we'll continue to run queries in-band so that results look as fresh and real-time as possible. To support this, I've had to make some changes to the River UI API server/handler so that it has a `Start` function that can be invoked to start background utilities like the query cache. It's a considerable change, but I think it leaves us in a more sustainable place API-wise because we may want to add other background utilities later on, and returning an `http.Handler` isn't enough because even if you were to start goroutines from `NewHandler`, it's very, very not ideal that there's no way to stop those goroutines again (problematic for anything that wants to check for leaks with goleak). I'm also going to propose that we increase the default API endpoint timeout from 5 seconds to 10 seconds. When I load in 3 to 5 million job rows, I see count queries taking right around that 3 to 5 seconds range. Since the original number of 5 seconds was a little arbitrary anyway, it can't hurt to give those queries a little more leeway. A problem that could still occur even with my proposal here is that if a user starts River UI and then immediately hits the UI, there won't be a cached results yet, and therefore the count query will go to the database directly, and that may still cause a timeout at 5 seconds. I've only applied caching to the count timeout so far, but I've written the `QueryCacher` code such that it can cleanly support other queries if we care to add them.
This one's related to trying to find a solution for #106. After messing around with query plans a lot on the "count by state" query, I came to the conclusion in the end that Postgres might actually be doing the right thing by falling back to a sequential scan, or at least only the minimally wrong thing. Even forcing the count query to run against a well-used index is a fairly slow operation when there are many jobs in the database. It's hard to provide specifics because caching affects the result so much (so running the same query twice in a row can produce vastly different timings), but I've seen the index version take _longer_ than the seq scan in some cases. So here, I'm proposing a radically different solution in which we add some infrastructure to the River UI API server that lets it run slow queries periodically in the background, then have API endpoints take advantage of those cached results instead of having to run each operation themselves, thereby making their responses ~instant. I've written it such that this caching only kicks in when we know we're working with a very large data set where it actually matters (currently defined as > 1M rows), with the idea being that for smaller databases we'll continue to run queries in-band so that results look as fresh and real-time as possible. To support this, I've had to make some changes to the River UI API server/handler so that it has a `Start` function that can be invoked to start background utilities like the query cache. It's a considerable change, but I think it leaves us in a more sustainable place API-wise because we may want to add other background utilities later on, and returning an `http.Handler` isn't enough because even if you were to start goroutines from `NewHandler`, it's very, very not ideal that there's no way to stop those goroutines again (problematic for anything that wants to check for leaks with goleak). I'm also going to propose that we increase the default API endpoint timeout from 5 seconds to 10 seconds. When I load in 3 to 5 million job rows, I see count queries taking right around that 3 to 5 seconds range. Since the original number of 5 seconds was a little arbitrary anyway, it can't hurt to give those queries a little more leeway. A problem that could still occur even with my proposal here is that if a user starts River UI and then immediately hits the UI, there won't be a cached results yet, and therefore the count query will go to the database directly, and that may still cause a timeout at 5 seconds. I've only applied caching to the count timeout so far, but I've written the `QueryCacher` code such that it can cleanly support other queries if we care to add them.
This one's related to trying to find a solution for #106. After messing around with query plans a lot on the "count by state" query, I came to the conclusion in the end that Postgres might actually be doing the right thing by falling back to a sequential scan, or at least only the minimally wrong thing. Even forcing the count query to run against a well-used index is a fairly slow operation when there are many jobs in the database. It's hard to provide specifics because caching affects the result so much (so running the same query twice in a row can produce vastly different timings), but I've seen the index version take _longer_ than the seq scan in some cases. So here, I'm proposing a radically different solution in which we add some infrastructure to the River UI API server that lets it run slow queries periodically in the background, then have API endpoints take advantage of those cached results instead of having to run each operation themselves, thereby making their responses ~instant. I've written it such that this caching only kicks in when we know we're working with a very large data set where it actually matters (currently defined as > 1M rows), with the idea being that for smaller databases we'll continue to run queries in-band so that results look as fresh and real-time as possible. To support this, I've had to make some changes to the River UI API server/handler so that it has a `Start` function that can be invoked to start background utilities like the query cache. It's a considerable change, but I think it leaves us in a more sustainable place API-wise because we may want to add other background utilities later on, and returning an `http.Handler` isn't enough because even if you were to start goroutines from `NewHandler`, it's very, very not ideal that there's no way to stop those goroutines again (problematic for anything that wants to check for leaks with goleak). I'm also going to propose that we increase the default API endpoint timeout from 5 seconds to 10 seconds. When I load in 3 to 5 million job rows, I see count queries taking right around that 3 to 5 seconds range. Since the original number of 5 seconds was a little arbitrary anyway, it can't hurt to give those queries a little more leeway. A problem that could still occur even with my proposal here is that if a user starts River UI and then immediately hits the UI, there won't be a cached results yet, and therefore the count query will go to the database directly, and that may still cause a timeout at 5 seconds. I've only applied caching to the count timeout so far, but I've written the `QueryCacher` code such that it can cleanly support other queries if we care to add them.
This one's related to trying to find a solution for #106. After messing around with query plans a lot on the "count by state" query, I came to the conclusion in the end that Postgres might actually be doing the right thing by falling back to a sequential scan, or at least only the minimally wrong thing. Even forcing the count query to run against a well-used index is a fairly slow operation when there are many jobs in the database. It's hard to provide specifics because caching affects the result so much (so running the same query twice in a row can produce vastly different timings), but I've seen the index version take _longer_ than the seq scan in some cases. So here, I'm proposing a radically different solution in which we add some infrastructure to the River UI API server that lets it run slow queries periodically in the background, then have API endpoints take advantage of those cached results instead of having to run each operation themselves, thereby making their responses ~instant. I've written it such that this caching only kicks in when we know we're working with a very large data set where it actually matters (currently defined as > 1M rows), with the idea being that for smaller databases we'll continue to run queries in-band so that results look as fresh and real-time as possible. To support this, I've had to make some changes to the River UI API server/handler so that it has a `Start` function that can be invoked to start background utilities like the query cache. It's a considerable change, but I think it leaves us in a more sustainable place API-wise because we may want to add other background utilities later on, and returning an `http.Handler` isn't enough because even if you were to start goroutines from `NewHandler`, it's very, very not ideal that there's no way to stop those goroutines again (problematic for anything that wants to check for leaks with goleak). I'm also going to propose that we increase the default API endpoint timeout from 5 seconds to 10 seconds. When I load in 3 to 5 million job rows, I see count queries taking right around that 3 to 5 seconds range. Since the original number of 5 seconds was a little arbitrary anyway, it can't hurt to give those queries a little more leeway. A problem that could still occur even with my proposal here is that if a user starts River UI and then immediately hits the UI, there won't be a cached results yet, and therefore the count query will go to the database directly, and that may still cause a timeout at 5 seconds. I've only applied caching to the count timeout so far, but I've written the `QueryCacher` code such that it can cleanly support other queries if we care to add them.
This one's related to trying to find a solution for #106. After messing around with query plans a lot on the "count by state" query, I came to the conclusion in the end that Postgres might actually be doing the right thing by falling back to a sequential scan, or at least only the minimally wrong thing. Even forcing the count query to run against a well-used index is a fairly slow operation when there are many jobs in the database. It's hard to provide specifics because caching affects the result so much (so running the same query twice in a row can produce vastly different timings), but I've seen the index version take _longer_ than the seq scan in some cases. So here, I'm proposing a radically different solution in which we add some infrastructure to the River UI API server that lets it run slow queries periodically in the background, then have API endpoints take advantage of those cached results instead of having to run each operation themselves, thereby making their responses ~instant. I've written it such that this caching only kicks in when we know we're working with a very large data set where it actually matters (currently defined as > 1M rows), with the idea being that for smaller databases we'll continue to run queries in-band so that results look as fresh and real-time as possible. To support this, I've had to make some changes to the River UI API server/handler so that it has a `Start` function that can be invoked to start background utilities like the query cache. It's a considerable change, but I think it leaves us in a more sustainable place API-wise because we may want to add other background utilities later on, and returning an `http.Handler` isn't enough because even if you were to start goroutines from `NewHandler`, it's very, very not ideal that there's no way to stop those goroutines again (problematic for anything that wants to check for leaks with goleak). I'm also going to propose that we increase the default API endpoint timeout from 5 seconds to 10 seconds. When I load in 3 to 5 million job rows, I see count queries taking right around that 3 to 5 seconds range. Since the original number of 5 seconds was a little arbitrary anyway, it can't hurt to give those queries a little more leeway. A problem that could still occur even with my proposal here is that if a user starts River UI and then immediately hits the UI, there won't be a cached results yet, and therefore the count query will go to the database directly, and that may still cause a timeout at 5 seconds. I've only applied caching to the count timeout so far, but I've written the `QueryCacher` code such that it can cleanly support other queries if we care to add them.
This one's related to trying to find a solution for #106. After messing around with query plans a lot on the "count by state" query, I came to the conclusion in the end that Postgres might actually be doing the right thing by falling back to a sequential scan, or at least only the minimally wrong thing. Even forcing the count query to run against a well-used index is a fairly slow operation when there are many jobs in the database. It's hard to provide specifics because caching affects the result so much (so running the same query twice in a row can produce vastly different timings), but I've seen the index version take _longer_ than the seq scan in some cases. So here, I'm proposing a radically different solution in which we add some infrastructure to the River UI API server that lets it run slow queries periodically in the background, then have API endpoints take advantage of those cached results instead of having to run each operation themselves, thereby making their responses ~instant. I've written it such that this caching only kicks in when we know we're working with a very large data set where it actually matters (currently defined as > 1M rows), with the idea being that for smaller databases we'll continue to run queries in-band so that results look as fresh and real-time as possible. To support this, I've had to make some changes to the River UI API server/handler so that it has a `Start` function that can be invoked to start background utilities like the query cache. It's a considerable change, but I think it leaves us in a more sustainable place API-wise because we may want to add other background utilities later on, and returning an `http.Handler` isn't enough because even if you were to start goroutines from `NewHandler`, it's very, very not ideal that there's no way to stop those goroutines again (problematic for anything that wants to check for leaks with goleak). I'm also going to propose that we increase the default API endpoint timeout from 5 seconds to 10 seconds. When I load in 3 to 5 million job rows, I see count queries taking right around that 3 to 5 seconds range. Since the original number of 5 seconds was a little arbitrary anyway, it can't hurt to give those queries a little more leeway. A problem that could still occur even with my proposal here is that if a user starts River UI and then immediately hits the UI, there won't be a cached results yet, and therefore the count query will go to the database directly, and that may still cause a timeout at 5 seconds. I've only applied caching to the count timeout so far, but I've written the `QueryCacher` code such that it can cleanly support other queries if we care to add them.
My test wasn't particularly scientific, but it seemed like it didn't make much of a difference. |
Hey,
I'm really enjoying river so far, so thank you for that!
I was playing around with River UI, but on a table with just over 5 million jobs the queries become so slow that the api requests are timing out and I'm just getting a white screen (specifically the
JobCountByState
query). Haven't looked into it too much, so not sure if it's just missing an index somewhere.This is happening both on my local M2 MacbookPro and on a small AlloyDB instance.
EDIT:
For some numbers, that query took 9s to complete on AlloyDB (which admittedly is under heavy use at the moment). The same query took just over 3s on my otherwise idle Macbook.
The text was updated successfully, but these errors were encountered: