Breaking: Rename old InsertMany, add new version with return values #589

bgentry · 2024-09-10T20:40:32Z

Responding to some customer feedback, this PR reimplements InsertMany / InsertManyTx using a new query with RETURNING so the inserted rows are returned. The old COPY FROM methods are retained for extremely high-volume use cases which do not care about return values, but are renamed to InsertManyFast and InsertManyFastTx. While these new methods are likely to be slower, for most use cases it won't matter and the added usability-by-default of this method is worthwhile IMO.

While this was initially implemented using a multirow values syntax and associated query builder, the current approach uses unnest along with array parameters to enable using RETURNING with an insert query nearly the same as what we already had for the database/sql driver.

This is a breaking change. However, my expectation is that most users were not checking the count value and as such will not have to change anything when they upgrade.

Pinging @krhubert in case he wants to give some feedback on this

TODO:

changelog entry
doc updates

riverdriver/riverpgxv5/river_pgx_v5_driver.go

bgentry · 2024-09-10T21:39:40Z

client.go

+// Job uniqueness is not respected when using InsertMany due to unique inserts
+// using an internal transaction and advisory lock that might lead to
+// significant lock contention. Insert unique jobs using Insert instead.
+func (c *Client[TTx]) InsertMany(ctx context.Context, params []InsertManyParams) ([]*rivertype.JobInsertResult, error) {


Had a thought that maybe instead of JobInsertResult being a plain struct we could consider using an interface. That would allow for expandability, particularly if we override these on the Pro side to return additional stuff at some point.

Need to think more about concrete use cases for that though.

bgentry · 2024-09-10T21:40:22Z

client.go

+		}
+
+		if param.InsertOpts != nil {
+			// UniqueOpts aren't support for batch inserts because they use PG


I have some ideas on making this work for multi insert but it'd probably be better for a video call brainstorming session.

brandur · 2024-09-12T03:36:46Z

Just a couple points on this one:

It's not really in the description, but I'm guessing the main driver here is that the inserted job rows are needed here? OOC, what was the use case for that?
Did you see the JobInsertFastMany query that's already in there for database/sql (because that doesn't support copy/from). A RETURNING * could easily be added to it, and it accomplishes this same thing in a tiny fraction of the LOCs, without all the string gluing (which is unsightly, but is also a security liability).

river/riverdriver/riverpgxv5/internal/dbsqlc/river_job.sql

Lines 226 to 251 in 7d98386

    
           -- name: JobInsertFastMany :execrows 
        
           INSERT INTO river_job( 
        
               args, 
        
               kind, 
        
               max_attempts, 
        
               metadata, 
        
               priority, 
        
               queue, 
        
               scheduled_at, 
        
               state, 
        
               tags 
        
           ) SELECT 
        
               unnest(@args::jsonb[]), 
        
               unnest(@kind::text[]), 
        
               unnest(@max_attempts::smallint[]), 
        
               unnest(@metadata::jsonb[]), 
        
               unnest(@priority::smallint[]), 
        
               unnest(@queue::text[]), 
        
               unnest(@scheduled_at::timestamptz[]), 
        
               unnest(@state::river_job_state[]), 
        
               -- lib/pq really, REALLY does not play nicely with multi-dimensional arrays, 
        
               -- so instead we pack each set of tags into a string, send them through, 
        
               -- then unpack them here into an array to put in each row. This isn't 
        
               -- necessary in the Pgx driver where copyfrom is used instead. 
        
               string_to_array(unnest(@tags::text[]), ',');

Did you consider inverting the functions such that returning function gets a new name instead of the current InsertMany getting switched to a new one? If I was designing from first principles, I'm not sure which of the two should get the main InsertMany name, but I could make an argument that it should be the non-returning variant because that's faster, and a lot of people won't need return job values anyway. And when not thinking about it from first principles, it'd certainly be an advantage to not make a breaking change, so we should weigh that fairly heavily in favor of InsertMany staying the non-returning variant.

bgentry · 2024-09-12T16:28:27Z

It's not really in the description, but I'm guessing the main driver here is that the inserted job rows are needed here? OOC, what was the use case for that?

Yes, it's to make it so that jobs added with InsertMany (i.e. workflows) have the same UX affordances as individual inserts. Doing this in a single bulk insert is dramatically better than asking users to insert the rows one-at-a-time, because that approach involves a round trip per insert.

There are lots of use cases where people want to log or record the job ID after it's been inserted so that they can track it, monitor it, or otherwise interact with the job programmatically after insertion.

Did you see the JobInsertFastMany query that's already in there for database/sql (because that doesn't support copy/from). A RETURNING * could easily be added to it, and it accomplishes this same thing in a tiny fraction of the LOCs, without all the string gluing

I gave this unnest approach a try, and pretty quickly ran into the fact that when using an array of a custom enum type (river_job_state[]), pgx requires that the OID be looked up and registered as part of each conn upon creation. That would require it to be added to any pgxpools' Configs prior to them being created. That's trivial to do in an individual app, but asking everybody using a library like River to do that is quite a bit more complicated and will degrade UX. We'd have to make people set up an AfterConnect() on the pool config.

The workaround is to pass an array of strings, and then cast it to river_job_state in the query. Not ideal, but probably not worse in efficiency then building up and sending a large query string.

I also needed to keep the workaround for tags that you have for dbsql, which is to join them with commas, pass a single-level text array, and then split them in the query. The reason is that unnest fully flattens multi-dimensional arrays and I didn't find a clean way to single-level unnest that didn't rely on custom plpgsql.

(which is unsightly, but is also a security liability).

Maybe I'm missing something, but I'm not seeing the security liability aspect of this. This code is was in its previous form very deliberately not string interpolating arguments—it's just building up the positional parameters in the query string. This is the same thing any ORM (ActiveRecord, Ecto, etc) would do in order to do a multirow values syntax insert. Of course sqlc doesn't make that possible, but given this is a useful feature to have access to it's unfortunately another major drawback of sqlc as it exists today.

The code maintenance thing is a much bigger concern, so I'm more than willing to try the unnest approach for now to see if we run into problems. We can always introduce more complex code as an optimization later on if needed. ~Here is that branch if you want to compare` The PR has been updated to use this approach.

Did you consider inverting the functions such that returning function gets a new name instead of the current InsertMany getting switched to a new one? If I was designing from first principles, I'm not sure which of the two should get the main InsertMany name, but I could make an argument that it should be the non-returning variant because that's faster, and a lot of people won't need return job values anyway. And when not thinking about it from first principles, it'd certainly be an advantage to not make a breaking change, so we should weigh that fairly heavily in favor of InsertMany staying the non-returning variant.

I did consider this, and landed on this approach for these reasons:

I would like the UX between the default paths for Insert and InsertMany to be as aligned as possible. In particular I have some ideas for bringing unique job support to multi-insert, and if possible I want the API to allow me to say whether or not an insert was skipped due to uniqueness. I would also like to get to a point where the code paths for these two cases are the same, i.e. we don't need to maintain separate queries for the single-insert case and can just internally wrap everything as a multi insert that happens to only have one row.
The cases where somebody is going to care about the perf difference between COPY vs another approach are likely very minimal—probably only where you're going to insert (many?) thousands of rows at once. Just guessing but I'd say that's probably <.1% of usage. Given that, I believe the far-less-common path should be the one with the weird name.
The actual impact to customers from this breaking change is probably quite small, because there is no practical need to ever check that the integer count returned by InsertMany actually matches the size of the array you gave it. Maybe some people were using it instead of len(rows) but I'd wager it's super unlikely.
We are still pre v1 for a reason, to tease out mistakes like this and correct them while we can.

I'm still open to changing the approach, but felt that the benefits pretty strongly outweighed the downsides here.

This new query allows many rows to be inserted (up to ~7280 as of now) while also allowing the new records to be returned. While this is not quite as fast as the `COPY FROM` option when loading many rows, it provides a better UX for the vast majority of use cases. It _does_ require that we ditch sqlc for this one query, because sqlc does not support the multirow values insert syntax due to the dynamic nature of the param placeholders.

This adds new implementations of `InsertMany` / `InsertManyTx` that use the multirow `VALUES` syntax to allow the new rows to be returned upon insert. The alternative `COPY FROM ` implementation has been renamed to `InsertManyFast` / `InsertManyFastTx`. The expectation is that these forms will only be needed in cases where an extremely large number of records is being inserted simultaneously, whereas the new form is more user-friendly for the vast majority of other cases.

brandur

Nice. Thanks for those changes!

brandur · 2024-09-13T00:29:09Z

riverdriver/riverdatabasesql/river_database_sql_driver.go

+
+		defaultObject := "{}"
+
+		insertJobsParams.Args[i] = valutil.ValOrDefault(string(params.EncodedArgs), defaultObject)


Should maybe pull this trick into the other JobInsertMany in here too.

brandur · 2024-09-13T02:23:33Z

riverdriver/river_driver_interface.go

@@ -118,6 +118,7 @@ type Executor interface {
 	JobInsertFast(ctx context.Context, params *JobInsertFastParams) (*rivertype.JobRow, error)
 	JobInsertFastMany(ctx context.Context, params []*JobInsertFastParams) (int, error)
 	JobInsertFull(ctx context.Context, params *JobInsertFullParams) (*rivertype.JobRow, error)
+	JobInsertManyReturning(ctx context.Context, params []*JobInsertFastParams) ([]*rivertype.JobRow, error)


This one's still kind of a "job insert fast" variant, so that should probably still be in there.

I wonder too if we should try to match up which one gets the suffix between the client and driver a little bit more. In the client, the "fast" version gets a suffix, but in the driver here the "returning" variant gets the suffix (i.e. so they're flipped).

Maybe:

JobInsertFastMany (called from InsertManyFast) -> JobInsertFastManyNoReturning

JobInsertManyReturning (called from InsertMany) -> JobInsertFastMany

Good call, renamed them as suggested ✌️

…iverqueue#589) * add JobInsertManyReturning to drivers This new query allows many rows to be inserted (up to ~7280 as of now) while also allowing the new records to be returned. While this is not quite as fast as the `COPY FROM` option when loading many rows, it provides a better UX for the vast majority of use cases. It _does_ require that we ditch sqlc for this one query, because sqlc does not support the multirow values insert syntax due to the dynamic nature of the param placeholders. * rename InsertMany to InsertManyFast * new InsertMany/InsertManyTx with return values This adds new implementations of `InsertMany` / `InsertManyTx` that use the multirow `VALUES` syntax to allow the new rows to be returned upon insert. The alternative `COPY FROM ` implementation has been renamed to `InsertManyFast` / `InsertManyFastTx`. The expectation is that these forms will only be needed in cases where an extremely large number of records is being inserted simultaneously, whereas the new form is more user-friendly for the vast majority of other cases. * renames

bgentry requested a review from brandur September 10, 2024 20:40

bgentry commented Sep 10, 2024

View reviewed changes

bgentry force-pushed the bg-job-insert-many-returning branch from ba2bccd to 772eba5 Compare September 12, 2024 19:07

bgentry added 2 commits September 12, 2024 14:10

rename InsertMany to InsertManyFast

f93246e

bgentry force-pushed the bg-job-insert-many-returning branch 2 times, most recently from b79871e to b5764d0 Compare September 12, 2024 19:17

bgentry marked this pull request as ready for review September 12, 2024 19:18

bgentry force-pushed the bg-job-insert-many-returning branch from b5764d0 to 73264a9 Compare September 12, 2024 20:34

bgentry force-pushed the bg-job-insert-many-returning branch from 73264a9 to 6214038 Compare September 12, 2024 20:56

brandur approved these changes Sep 13, 2024

View reviewed changes

renames

c78055b

bgentry enabled auto-merge (squash) September 13, 2024 03:00

bgentry merged commit 67c8a8a into master Sep 13, 2024
14 checks passed

bgentry deleted the bg-job-insert-many-returning branch September 13, 2024 03:00

bgentry mentioned this pull request Sep 13, 2024

Add middleware system for jobs #584

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Breaking: Rename old InsertMany, add new version with return values #589

Breaking: Rename old InsertMany, add new version with return values #589

bgentry commented Sep 10, 2024 •

edited

Loading

bgentry Sep 10, 2024

bgentry Sep 10, 2024

brandur commented Sep 12, 2024

bgentry commented Sep 12, 2024 •

edited

Loading

brandur left a comment

brandur Sep 13, 2024

brandur Sep 13, 2024

bgentry Sep 13, 2024


		defaultObject := "{}"

		insertJobsParams.Args[i] = valutil.ValOrDefault(string(params.EncodedArgs), defaultObject)

Breaking: Rename old InsertMany, add new version with return values #589

Breaking: Rename old InsertMany, add new version with return values #589

Conversation

bgentry commented Sep 10, 2024 • edited Loading

bgentry Sep 10, 2024

Choose a reason for hiding this comment

bgentry Sep 10, 2024

Choose a reason for hiding this comment

brandur commented Sep 12, 2024

bgentry commented Sep 12, 2024 • edited Loading

brandur left a comment

Choose a reason for hiding this comment

brandur Sep 13, 2024

Choose a reason for hiding this comment

brandur Sep 13, 2024

Choose a reason for hiding this comment

bgentry Sep 13, 2024

Choose a reason for hiding this comment

bgentry commented Sep 10, 2024 •

edited

Loading

bgentry commented Sep 12, 2024 •

edited

Loading