Data Migrations

Data migrations are straightforward if we accept that the system can remain unavailable indefinitely during the migration. It is also straightforward enough to provide read operations throughout a migration. The main complexity with migrations is how to perform them without disruption of write operations in the system.

When planning a migration, we have to consider some questions:

How much data are we dealing with?
Do we need to keep all the data?
What components are depending on
Why are we migrating?
Is this an opportunity to transform, sanitize, or filter the data in some way?
What operations need to remain available to clients during the operation?
Does the answer change depending on the expected duration of the migration? In other words, is a short outage okay but a longer outage not acceptable?

Ultimately, a successful migration must do these things:

Ensure that the new datastore contains all desired data from the old datastore up to and including the moment that writes to the old datastore are stopped
Ensure that the new datastore is able to successfully handle the traffic that the old datastore was receiving
Gracefully handle a rollback scenario in case things go wrong

Migrating via a secondary background process

One approach for migrating would be to leave the existing system as-is while bringing the new datastores up to speed before switching over. That could look something like this:

If there is more than one component directly reading or especially writing to the datastore, consider first standing up an API in front of the datastore and directing all operations through that
Implement write operations toward the new datastore and run them alongside writes to the old (failed writes to the new datastore should be monitored but not yet appear as a failure to the client)
After observing that writes to the new datastore are reliable, modify the implementation to require successful writes to both datastores. From this point on, incoming data should exist in both or none of the two datastores. On the other hand, effective latency and failure rate will be the worst of the two datastores. You may want a mechanism for quickly disabling this requirement if a problem does emerge
Perform backfill of historic data into the new datastore as an offline process
Perform any validation of the new datastore
Switch read operations over to the new datastore, perhaps gradually. Because writes to the old datastore are still ongoing, we can gracefully rollback to read from the old datastore at any point
After observing that the system is stable when reading from the new datastore for a sufficient amount of time, writes to the old datastore can be disabled. The amount of wait time is a tradeoff between storing twice as much incoming data and the impact on latency or failure rate that the old datastore may be having.

Migrating as part of regular operations

Alternatively, data can be migrated if and when it is accessed. For example, if user accounts are being migrated or merged into a different system, users may be required to verify and update their information and credentials the next time they attempt to login. After some time, only the least regular users (or, more generally, least frequently accessed data) remains unmigrated and can be dealt with via a secondary process with a smaller scope that is less likely to impact the users or data that are most often needed.

Strategies for handling breaking changes

Upcasting / Transformation

Data remains stored with its original schema version
Upcasting logic appropriate for a given schema version runs at read-time to transform the data to a current, standard representation
Upcasting logic grows, can become complicated, and must be maintained indefinitely
Lowers operational risk and complexity while increasing runtime cost and maintenance complexity

Versioned Schemas

Similar to upcasting, data is stored in its original form with an associated type or version identifier
Unlike upcasting, the data is consumed throughout the system in its original form with version-specific logic maintained where needed
Stored data remains untouched, but decentralized, version-specific logic makes the system more complex

Backward-compatible Schema Evolution

The schema is modified using only backward-compatible changes (adding new optional fields with default values)
No transformations or version-specific logic needed
Simplest option, but can't handle truly breaking semantic changes

Copy-and-Transform

Perform an "offline" process to read, transform, and write records to a new schema
Runtime system logic only needs to consider the current schema, no transformation or legacy code
Greater operational risk, including potential downtime and irreversible data loss
Makes sense when necessary to simplify the codebase and when a controlled migration window is acceptable

Dual-Write / Shadowing

Records are written to new and old schemas by the live system in parallel
Consumers gradually switch to reading from the new schema
The old schema can be retired once all consumers are migrated
Can eliminate any disruption to operations at the cost of temporary overhead and complexity