-
Notifications
You must be signed in to change notification settings - Fork 6
Longevity Migrations Ideas for Future Directions
The current implementation (release 0.26) of longevity migrations is pretty barebones, and satisfies two basic needs. First, it satisfies the basic need to migrate the schema and database backing your domain. Second, it is a platform we can work with to generate more sophisticated migrations. There are two major dimensions in which the migrations implementation can be improved: expressivity; and performance.
This document outlines some ideas I have for improving and extending the existing migrations codebase.
Right now, the old schema is torn down as part of the process of running the migration. Extract this as a separate step, so that users can continue to run read-only applications against the initial model, and not worry about their data being dropped underneath them as one of the final steps of completing the migration.
Ideally, this would be exposed as a flag to the sbt migration rule, such as sbt "migrateSchema V1_to_V2 --retainInitialSchema"
. And a new task such as sbt "dropSchema V1"
, that can be run at the user's leisure.
Presently, there are exactly three kinds of migration steps: Drop
, Create
, and Update
. The first two remove and add persistent types to the domain model, respectively. The Update
step migrates the data from the initial domain model into the final domain model according to a Scala function that migrates the data on a per-object basis.
An Unchanged
migration step would be a clear option to to add to the existing steps. This would save the user the hassle of having to define what is essentially an identity function, but actually isn't, because the function has to convert from, say, version1.User
to version2.User
. Even if these have exactly the same shape, and identity function clearly won't do, because the to users have different types.
Aside from making life slightly easier for the user, this would expose a very nice optimization in some cases - the table could just be renamed instead of having to apply the conversion function in Scala memory. (Note that along with renaming the table, we may well have to drop and create some database keys and indexes.)
This optimization would not be available for Cassandra, because you cannot rename a table in Cassandra. Also, take note that if the structure of the primary key changes, you will not be able to reuse the same table. You will have to rebuild it. However, Cassandra back end would probably still be able to perform an Unchanged
migration without loading the data into Scala memory. We'd probably have to use something like the CQL COPY
command.
One kind of migration that is not currently possible is one that merges two persistent types into a single persistent type. Think merging two entity aggregates into a single entity aggregate. For instance, maybe you have User
and UserProfile
persistent types, and want to merge them into a single persistent type User
.
This is certainly possible to do with longevity, but not within a single migration. One possibility here would be to provide a InitialAwareUpdate
step, with a better name than that hopefully, that would take a function that also takes a Repo[M1]
as argument. So where we have:
Migration.Builder.update[P1, P2](f: P1 => P2)
We could also have:
Migration.Builder.updateInitialAware[P1, P2](f: (P1, Repo[M1]) => P2)
One complication here: we probably want to have this function return an IO[P2]
(or a more generic F[P2]
), so that the user can actually use the Repo
without some unsafe blocking stuff.
Another use case would be to introduce a new read view on our data. As an overly simplistic example, suppose we wanted to introduce UserView
, that exposed only a subset of the data in a User
. We can accomplish this in longevity, but not in a migration. To assist in situations like this, we can give the user access to the final repository, so they can read and update rows in the final model. Something like:
Migration.Builder.updateFinalAware[P1, P2](f: (P1, Repo[M2]) => IO[P2])
We need to make clear to the user that the contents of the data available via Repo[M2]
is in an intermediate state, as we are in the process of producing that data.
We should probably also be able to combine these two ideas, so the user has access to both the initial and final states. We could also consider a case where the migration is not actually one-to-one, but where the user has free reign to write to the M2
database based on each row is sees from M1
. Something like:
Migration.Builder.updateFreely[P1, P2](f: (P1, Repo[M1], Repo[M2]) => IO[Unit])
Probably wouldn't be too hard... it would probably be a pretty good idea to provide implicits for cats.Cartesian
, cats.Applicative
, and things like that, for any longevity.effect.Effect
.
We could create a DSL for describing the migration of a persistent according to property updates. Something along these lines has already been proposed here, but for the purpose of doing in-place updates, or updates that don't need to load persistent objects into memory first:
The DSL would be very similar for migrations, except that the left-hand sides would all involve properties of P2
, and the right-hand sides would involve constants and properties of P1
. We would need to make sure that the DSL descriptions of the update step covered the entire RHS, i.e., all properties in the P2
. We probably also want to make sure that the same property of P2
is only mentioned a single time in the update. (We could probably try to carry over any missing P2
properties from P1
for the sake of brevity.)
One advantage of expressing updates in this way is to allow for in-place migrations; i.e., no need to load the objects into memory, we can do the migration work directly in the database. Presently, this would only be possible with MongoDB back end, since the other back ends encode the persistent object as JSON in a regular text column in the database. But this would be a possibility with a property-based JDBC or Cassandra back end, as described here:
- https://github.com/longevityframework/longevity/issues/47
- https://github.com/longevityframework/longevity/issues/46
I'm particularly fond of the idea of a property-based back end for Cassandra right now, and it's actually within the realm of possibility that I might take that task on.
Just to note that we could implement an in-memory version of a property-based update that would be functional for all back ends.
To me, this is probably the most exciting direction that we could take longevity migrations. With traditional SQL migration scripts, we need to stop the servers running against the initial version; apply the migration scripts from initial to final; and then restart the servers running against the final version. Sometimes, the migration scripts run quickly, sometimes they don't. This "hard stop" approach is a bit of a challenge for systems that need to be always up, and a lot of engineering effort goes in to making these things as transparent as possible.
With the functional migration approach we have here, there is nothing that stops us from continuing to run the servers running against the initial version while the migration is taking place. We can just replace the servers that simply run the initial version, with equivalent servers that are aware of that the migration is taking place. This would require just a slight modification: Every update or delete would need to perform the corresponding update or delete in the final schema. The whole process would look something like this:
- Vanilla version of initial servers running
- Create migration schema for initial version and create schema for final version
- Replace vanilla version of initial servers with migration-aware version of initial servers
- This can actually be performed with overlap, so that there is no downtime
- Start the meat of the migration, updating all the rows
- If any rows are updated by the servers in the meantime, these changes will be mirrored in the final schema
- Wait until all the data is migrated
- Shut down initial version servers
- Kick up final version servers
- Finish up the migration by dropping the initial version schema
It's a little involved, but we could have an SBT task to help walk the user through all the steps. And the gains are great. The only downtime is when we shut down the initial version servers and kick up the final version servers.
We can take the soft stop migrations one step further if the update steps have an inverse function too. Maybe something like:
Migration.Builder.updateWithInverse[P1, P2](f: P1 => P2, g: P2 => P1)
In this way, we can expand the steps of the soft-stop migration above to include a phase where initial version servers and final version servers are running simultaneously. There would then have to be two versions of the final version schema: one that is aware of the ongoing migration, and one that is not. In total, something like this:
- Vanilla version of initial servers running
- Create migration schema for initial version and create schema for final version
- Replace vanilla version of initial servers with migration-aware version of initial servers
- Start the meat of the migration, updating all the rows
- Wait until all the data is migrated
- Kick up migration-aware final version servers
- Shut down initial version servers
- Kick up vanilla final version servers
- Shut down migration-aware final version servers
- Finish up the migration by dropping the initial version schema
On caveat here, the updateWithInverse
step above will only work assuming the migration from initial to final does not lose information. If information is being dropped, then we need to find a way to retain that when migrating an object backwards from final to initial version. The easiest way to do this would be to supply the original P1
when it is available. Something like this:
Migration.Builder.updateWithInverse[P1, P2](f: P1 => P2, g: (P2, Option[P1]) => P1)
Here, we stipulate that the Option[P1]
is None
if the object was newly created by the final version, and Some(p1)
if the object exists in the initial version as well.