The following lists the strategies and best practices you should cultivate when diagnosing and repairing replication problems. They are presented in no particular order because depending on the problem you are trying to solve one or more may be helpful.
Know Your Topology
If you are using MySQL replication on a small number of servers then it may not be that difficult to commit to memory how the topology is configured. It may be as simple as a single master and one or more slave or as complex as two servers in a multi-master topology. However, there is a point where committing the topology and all of its configuration parameters becomes impossible.
The more complex the topology and the configuration, the harder it becomes to determine where to start looking for problems or worse still where to to begin your repair operations. It would be very easy to forget a lone slave in a topology of hundreds of slave servers.
It always a good idea to have at your fingertips a map of your topology and the current configuration settings. You should keep a record of your replication setup in a notebook or file and place it where you and your colleagues or subordinates can find it. This information will be invaluable to someone who understands replication administration but may have never worked with your installation.
You should include a textual or graphical drawing of your topology and notate any filters (master and slave) as well as the role of each server in the topology. You should also consider including either the CHANGE MASTER command complete with options and the contents of the configuration files for all of your servers. A drawing of you topology need not be a sophisticated or artistic wonder. A simple line drawing will do nicely.
There are certain problems inherent in certain topologies. We have already discussed many of these in this and previous chapters. What follows is a list of the types of topologies including a short description and some common issues to be aware concerning known vulnerabilities.
- Star (also called single master) - This is the typical single master and many slaves topology. There are no special limitations or problems other than those that apply to replication as a whole. However, the process of promoting a slave to a master can be complicated in that you must ensure to use the slave that is most up-to-date with the master. This can be difficult to determine and may require examining the state of every slave.
- Chain - This configuration is where a server is a master to one slave where the slave is in turn a master to another slave and so on with the end point being a simple slave. Other than the server id issue mentioned above, there is also the problem of knowing which position an intermediate master/slave was at when it failed. Promoting a new master/slave node could require additional work to ensure consistency.
- Circular (also called ring) - This is the same as the chain topology except there is no end point. This topology requires careful setup so that events are terminated at the originating server.
- Multi-master - This is similar to a circular topology but in this case each master is a slave of every other master. This topology has all of the limitations and problems of circular replication as well as being one of the most complex topologies to manage.
- Hybrid - A hybrid topology is one that uses elements of all other topologies. You typically see this form in a large organization that has compartmentalized functions or divisions that require similar but often isolated elements of the infrastructure. Isolation is accomplished using filtering whereby data is divided among several slaves from an original master (sometimes called the prime or master master) that contains all of the data to slaves which become masters to their own star topologies. This is by far the most complex topology and requires careful notation and diligence to keep your documentation updated.
Check the Status of All of Your Servers
One the best preventative tasks you can perform is to regularly check the status of all of your servers in your topology. This need not be complicated. For instance, you could setup a scheduled task that launches the MySQL client and issue a SHOW MASTER STATUS or SHOW SLAVE STATUS and print out or email you the results.
Of course, you have to read these emails! We would suggest setting up the scheduled task to run at a point before you have time to read them. For example, you could setup the tasks to run right before lunch then read the output during or after lunch.
Keeping a close eye on your server status this way is a good way to be informed when there is a potential problem and it also gives you a chance to react more quickly when errors occur.
You should look for errors, monitor the slave lag, check your filters to see that they match your records, and ensure all of your slaves are running and not reporting warnings or connection errors.
Check Your Logs
Along with checking your server status regularly, you should also consider checking the logs of your servers periodically. You can accomplish this easily using the MySQL Administrator graphical user interface to connect to each server in turn and look at the most recent entries in the logs.
If you are diligent in this effort, it will pay dividends in the form of detecting problems early. It is sometimes the case that errors or warnings are written to the log which, at the time, may not manifest in anything detectable. Catching these telltale signs early can make your repair work much easier.
We recommend examining your logs at the same time you examine the status of your servers like described above.
Check Your Configuration
Like the status and log files, you should also routinely examine your configuration files to ensure your documentation is up to date. This isn't as critical as checking the logs but it is often overlooked. We recommend checking your configuration files and updating your documentation at least once a week if you have a lot of changes and at least once a month if you have few changes to the topology or servers. If you have an environment where there are more than one administrator, you may want to consider doing this more frequently.
Orderly Shutdown
Sometimes it is necessary to stop replication while you diagnose and repair a problem. If it is the case that replication has already stopped, you may not need to do anything, but if you have a complex topology and the error is related to data loss it may be safer to stop replication across the topology. But you should do so in a controlled and safe manner.
There are several strategies to doing a controlled shutdown of your replication topology. If data loss is an issue and slave lag is not, you may want to lock the tables on your master and flush the binary logs then wait for all of the (remaining) slaves to catch up then shut the slaves down. This will ensure all events are replicated and executed on the slaves that are operational.
On the other hand, if the problem is severe you may want to start with the slaves on the leaves of your topology and work your way up the topology leaving the master running. However, if you leave the master running (e.g., you don't lock all of the tables) and you have a heavy load of updates running and your diagnostic and repair takes a long time, your slaves will be lagging behind your master when you restart replication. It is better to stop updates on the master if you think your repair will take a long time.
If you are faced with a difficult problem or (worse) a problem that appears randomly or without warning you may want to consider shutting down replication completely. This is especially true if you have a situation with only one of your slaves. Shutting down replication will allow you to isolate the server while you diagnose the problem.
Starting Replication After a Failure
It is also important to restart replication in an orderly manner. It is often best to restart the replication topology under more controlled conditions such as only one master and one slave. Even if your topology is more complex, having the most basic of building blocks of replication started initially will allow you to test and ensure the problem is fixed before restarting all of your servers.
Isolation in this case is essential when dealing with problems that are confined to a single slave or with a set of events. It is also helpful to start with a single master and a single slave so that if you cannot fix the problem, you can get help from MySQL professionals easier because you have isolated the problem to the smallest set of parameters as possible. See the section below on reporting replication bugs for more details.
Manually Execute Failed Queries
One of the most overlooked strategies is examining the queries in the relay log or binary log for clues to what has gone wrong. It is easy to get stuck on researching the error on the slave and diagnosing all of the many things that can go wrong, but sometimes and especially when there is a query involved, you can gain far more information about the problem by isolating the slave server in question and attempting the query manually.
If you are using statement-based replication this is an easy task because the query is human readable in the binary or relay log, but if you use row-based replication execution of the query can still be done but you cannot read the query itself. In this case, a malformed query or a query that uses a missing, incorrect, or corrupt reference may not be obvious until you execute the query manually.
Remember, you should always make a backup of your data before attempting diagnostics that could result in changes to the data. Running a broken query is definitely one of those cases. We have seen where queries that cause replication errors can sometimes succeed when run manually. When this occurs, it is usually indicative of a slave configuration error or binary or relay log issue rather than a problem with the query itself.
Server bottlenecks and failures are a fact of life in any database deployment, but they don't have to bring everything to a halt. MySQL provides several features that can protect you from outages, whether you're running directly on the hardware, on virtual machines, or in the cloud. This book shows you how to use these features effectively, and helps you determine which combination of features will give you the most reliable system for a price you can afford.




Help





