Why don't you see ghosts in galeras?

The software development sometimes implies changes that affect the data model but such modifications have a series of problems such as the necessity of doing them online or the size of data to be reorganized.

In this post I want to talk about the existing tools for database schema change in MySQL ecosystem, what limitations do those tools have if you use Galera Cluster for replication and how I have adapted gh-ost, the GitHub schema change tool, to be able to use it with Galera.

Schema changes

With MySQL it is possible to make schema changes in two main ways:

Using the ALTER operation of MySQL; it is the simplest way but it does not allow to pause the process in case of necessity and except for certain operations, it prevents the concurrent use of the table to be modified during the process.
Through the use of third party tools that allow a greater control of the process although with certain restrictions. Some of these tools can be pt-osc, OnlineSchemaChange or gh-ost.

These two methods are also applicable if you use Galera although, because it follows a master-master approach, it has some peculiarities to apply database changes, both through the operations provided by MySQL and by the third party tools.

Galera

Galera is a plug-in for InnoDB that provides a multi-master virtually synchronous replication. This approach allows, unlike normal MySQL asynchronous replication, that all the nodes in the cluster have the same data and, therefore, do writes in any node of the cluster sacrificing some write performance.

Galera replication. Image from severalnines

Schema changes are special operations in Galera, since they modify the database and are non-transactional. Galera has two modes for applying database changes using the ALTER operation:

TOI¹ are schema changes made in all nodes of the cluster in the same total order sequence, preventing other transactions from being applied during the operation.
RSU² are schema changes that are executed locally, affecting only the node in which they are executed. The changes are not replicated to the rest of the cluster.

These two modes have a number of drawbacks: the TOI operations prevent other transactions to commit in the whole cluster during the schema change while the RSU operations can leave the cluster in an inconsistent state if the binary log in ROW format is not compatible between schemas.

Regarding the use of third party tools, Galera has a series of limitations among which is the lack of support for table locking, an operation necessary in OnlineSchemaChange and gh-ost tools to apply the last events due to its asynchronous nature, a problem that pt-osc does not have because the events are applied by the database due to the use of triggers.

Although pt-osc is compatible with Galera, the use of triggers presents a number of drawbacks because it is not possible to pause the operation completely if necessary and the use of triggers adds an extra load to the database. Therefore, to allow the use of gh-ost in Galera, it is necessary to investigate why it need the use of LOCK TABLES and look for an alternative approach.

gh-ost

gh-ost is GitHub's tool for applying online schema changes and has interesting advantages over other existing tools. One of its main features is that it doesn't use triggers to synchronize the changes of the original table with the ghost table ³ but the binlogs of MySQL and therefore it really allows to stop the migration completely if necessary.

The gh-ost logo

The reason that prevents using gh-ost with Galera is that, in order to do the cut-over phase, gh-ost makes use of the lock operation, that it is not supported in Galera.

Cut-over

In the documentation of gh-ost the cut-over phase is defined as the final major step of the migration and its goal is to replace the ghost table with the original one. For this purpose, it is necessary that both tables have the same records applied at the time of the swap, otherwise data would be lost in the migration process.

Unlike other online schema change tools such as pt-osc, gh-ost does not delegate in the database the application of the modified records during the migration and, therefore, it needs to momentarily block the writes in the original table until all pending changes in the binlog are applied, that is where the use of the lock operation in the original table is necessary.

LOCK TABLES

The first approach that may come to mind is that although the lock operation in Galera does not allow blocking access to a table in all nodes atomically, it does accept blocking access to a particular node, so it may be possible to apply LOCK TABLES manually in all nodes. After some research and a couple of tests this approach is discarded for several reasons:

The atomic cut-over algorithm of gh-ost needs the use of several independent connections to work and although it guarantees to leave the cluster in a consistent state in the case of errors in the connections, this premise is no longer true if a LOCK TABLES must be applied for each node. If you want to know how gh-ost strategy to apply the atomic cut-over works, I recommend you to read Issue #82.
As can be seen in MDEV-12647, it is quite possible to leave some of the nodes hanging when using the LOCK TABLES on each node.

Therefore, the use of LOCK TABLES is discarded and it is necessary to analyze other existing solutions to perform schema migrations without loss of service and that if they allow its use with the Galera, that is where pt-osc comes into play.

Man thinking. Image from Startup Stock Photos

pt-osc

pt-osc is a Percona tool that allows to perform schema migrations without loss of service compatible with Galera and it is based on delegating to the database the data modified during the migration using triggers, so the changes of the original table are synchronized with the ghost table. If you want to know how to ensure consistency I recommend you to read this answer from StackExchange.

The Percona toolkit logo

Solution proposed

Therefore, if it is possible to perform a consistent migration using triggers it may be possible to use them even if only for the cut-over phase. To do this, it is possible to raise a new cut-over strategy, the procedure for which is as follows:

At the time of starting the cut-over an event is injected in the binlog to notify gh-ost to stop writing the new records in the ghost table but to keep reading them from the binlog.
Once the gh-ost event is received it stops applying the events but stores the values according to the unique key used during the migration.
The necessary triggers are created to synchronize the new records between the original table and the ghost one.
A new event is injected into the binlog to notify that the triggers have been created.
Once this second event is received, the binlog is no longer read and therefore alto it stop storage the values according to the unique key.
The records between the two events may be inconsistent since some recors of the binlog have not been applied. To solve this problem, a clean-up phase is performed, which consists of removing them from the ghost table is exists and inserting them again from the original table.
Once the events have been sanitized, both tables are consistent, so it is possible to use the RENAME TABLE operation to swap the original table with the ghost one in an atomic way.
After that operation or in the case of failure, the triggers are removed.

Results and conclusion

Once I proposed a solution I implemented the necessary changes to the gh-ost code and after a couple of weeks of testing we started using it in production with very good results, so I created the pull request #780.

Regarding the conclusion, even if gh-ost is considered triggerless, the life of the necessary triggers will be equal to the time it takes to apply the pending changes to the binlog and clean up the entries that may be inconsistent, which will be just a few seconds in the worst case, so it is considered acceptable.

Total Order Isolation. ↩
Rolling Schema Upgrade. ↩
With ghost table I refer the table where the schema changes have been applied and that needs to be filled in with the data of the original one. ↩