The synchronization approach relies on CycleCloud’s internal database backup feature to pass snapshots of the data store from your currently active instance to a cold instance in a safe and reliable manner. It requires that you maintain a live machine with a powered-down (cold) CycleCloud instance that matches your active CycleCloud instance. It has the advantage of being very fast to bring up in a disaster recovery scenario thanks to the frequent synchronizations and the already-on hardware.

When your currently active instance fails, you:

  • Restore the cold instance from the latest sync
  • Turn on the cold CycleCloud instance
  • Switch your DNS records to point to this now-running instance.

Creating Your CycleCloud Data Store Backup Policy

Your active instance must be set to backup the data store to local disk periodically. By default, CycleCloud is set to do backups at intervals that are friendly to the synchronization approach for disaster recovery.

If you would like to change the default policy click on the Admin menu and select “Browse Data”.

From the list of data store types on the left side of the page select Application.Backup plan. Select the one plan in the top half of the table view and then select its entry in the lower half of the table and click the Edit button.

A plan consists of the following attributes:

Attribute Type Description
Name String A unique identifier for this backup plan.
Schedule String How often to take a snapshot, see The Schedule.
BackupDirectory String Where on disk to store the data for this backup.
Description String An informative description for the backup plan.
Disabled Boolean If true, backups will not be taken for this plan. Existing backups will not be removed. Optional, defaults to false.

The Schedule

The Schedule attribute of a backup plan controls how often to take backups. It supports a rolling schedule of frequent, recent backups and less frequent, older backups. The schedule is a set of non-overlapping intervals of increasing length. For example, a simple “keep backups every hour for a day, and daily after that, for a week” would produce 24 backups in the first day, and 6 more spaced out through the remaining 6 days. The syntax for expressing this is a comma-separated list of durations, followed by the total duration. Whitespace is not significant. This example would be represented as

1h,1d/7d

In this case, as backups are taken, eventually there will be 24, all an hour apart and spanning one day. When the 25th is taken, the oldest backup will be preserved as the first daily backup. Subsequent hourly backups will be deleted until another day elapses, at which time the oldest backup becomes the second daily backup and the day-old backup becomes the first daily backup. This will continue for 4 more days, at which point we will have 24 + 6 = 30 backups. When the oldest backup gets too old (over a week), it is removed. Note that the backups are not labeled as “hourly” or “daily” on disk. They are simply tagged with the date and time they were taken.

The schedule can be a more complicated pattern than the above. The only requirement is that each successive interval be a multiple of the previous. For example:

5m,15m,1h,2h,4h,8h,1d/7d

This would keep 5-minute backups for 15 minutes, 15-minute backups for rest of the hour, backups at the 1 hour, 2 hour, 4 hour, 8 hour and 16 hour marks, and daily backups for 7 days, or 14 total backups, with a focus on recent data.

The schedule can be changed at any time, and CycleCloud will attempt to preserve as many existing backups as it can by reusing them for the new schedule. It will delete backups as necessary to match the desired number of samples and distributions over time. It only deletes backups if there are too many in an interval, not too few, so in no case will changing the schedule wipe all the backups and start over.

Creating Your Cold DR Instance

A mirror installation of your CycleCloud instance is required on your cold instance. The cold instance should be installed in the same local path and have the same file-level permissions and ownership. The cold instance should be set to not start on boot if the server on which it is being staged happens to restart. The cold instance will also require the HTCondor binaries be installed the same as on your currently active instance if you’re using CycleCloud to manage HTCondor pools, or the appropriate Grid Engine binaries if you’re managing xGE pools.

On Linux you can use rsync to create the initial, mirror copy of your currently active instance. For example, if the currently active instance is in /opt/cycle_server on MachineA and you want to use MachineB for disaster recovery, you could do the initial synchronization of MachineA -> MachineB with:

rsync -avz -e ssh remoteuser@MachineB:/opt/cycle_server /opt/cycle_server/

Note

For more information on using rsync with ssh please see http://troy.jdmz.net/rsync/index.html.

You will also need to configure MachineB‘s /etc/init.d/cycle_server file and set CycleCloud to not start in any run level on reboot. You can copy the active instance’s /etc/init.d/cycle_server file to your cold instance in much the same way you copied CycleCloud:

rsync -avz -e ssh remoteuser@MachineB:/etc/init.d/cycle_server /etc/init.d/

On Windows you should do a fresh install of CycleCloud using the same CycleCloud installation package that was used to deploy CycleCloud on your active instance. Once installed, stop CycleCloud with c:Program FilesCycleCloudcycle_server.cmd stop and set the two CycleCloud-related services to not run on boot.

CycleCloud and CycleCloudDB services displayed in the Services menu.

Licensing Your DR Instance

CycleCloud uses node-locked licenses, and your DR instance is unlikely to look the same as your active instance to CycleCloud’s licensing code. You will need a license specific to your DR instance in order to run CycleCloud. To obtain a DR instance license please contact sales@cyclecomputing.com.

Once you have the DR instance’s license text, create a license.dat file in the root CycleCloud installation directory and place the text in to this file. Typically, on Linux, this file would be /opt/cycle_server/license.dat and on Windows it would be C:Program FilesCycleCloudlicense.dat. Make sure the file is readable by the account that CycleCloud will run under when it is started on the machine.

Configuring Failover Using a Shared Filesystem

If both CycleCloud hosts mount a shared drive, then configuring Hot-to-cold failover is simple.

The simplest failover strategy is to install and run CycleCloud directly from the shared drive. If this option is used, then in a failover scenario when CycleCloud is stopped on the currently active machine due to failure, CycleCloud may simply be started on the cold backup and it will immediately resume operation.

However, running CycleCloud from a shared drive may be put a significant load on the filesystem and network. Thus, for heavily loaded CycleCloud installations the recommended configuration is to run CycleCloud from the local drive and store only the backups to the shared drive.

To configure CycleCloud to use a shared drive for backups, select a location on the shared drive to store the CycleCloud backups. Then, for both the Active and Cold CycleCloud instances, configure the BackupDirectory attribute in the BackupPolicy ad to use this location. One way to do this is to install the active CycleCloud instance, apply the configuration change, and then copy it to the cold instance using rsync as described above.

Configure Periodic Active-to-Cold Instance Synchronization

An active-to-cold synchronization has two parts: the mainly-static parts and the dynamic data store. The layout of a CycleCloud installation on disk looks as follows:

cycle_server/
        components/
        config/
        cycle_server
        data/
                backups/
                cycle_server/
                derby.properties
        docs/
        lib/
        license.txt
        logs/
        plugins/
        README.txt
        system/
        util/
        work/

The data store, including backups are in the data directory. The remaining directories and files are mostly static. The sync of the static directories should exclude the data directory and the sync of the data directory should only include the backups subdirectory.

Using rsync

This approach is intended for use with a Linux-based setup. There are rsync ports available for Windows but they are generally harder to configure and can have issues with complex Windows ACL settings on files and directories.

Your rsync installation should be configured such that password-less rsyncs can occur between your active instance and your cold instance. We recommend using ssh to accomplish this. We also recommend running this process as root on the machines so file permissions and ownership are easily maintained across the boxes. You will need to generate a pair of ssh keys for your active instance to use during the rsync process so that rsync doesn’t require a password to communicate with the cold instance from your active instance.

To set up your key pair for this rysnc, run the following as root on the active instance:

% ssh-keygen -t rsa -b 2048 -f /root/cron/dr-rsync-key
Generating public/private rsa key pair.
Enter passphrase (empty for no passphrase): [press enter here]
Enter same passphrase again: [press enter here]
Your identification has been saved in ~/cron/dr-rsync-key.
Your public key has been saved in ~/cron/dr-rsync-key.pub.
The key fingerprint is:
2e:28:d9:ec:85:21:e7:ff:73:df:2e:07:78:f0:d0:a0 root@thishost

% chmod 600 ~/cron/dr-rsync-key

For this keypair to work we need to add the contents of ~/cron/dr-rsync-key.pub to the /root/.ssh/authorized_keys file on the cold instance.

First copy the file over to the cold instance:

scp ~/cron/dr-rsync-key.pub root@remotehost:/root/

Then ssh to the cold instance, obtain a root prompt, and import the key:

% if [ ! -d .ssh ]; then mkdir .ssh ; chmod 700 .ssh ; fi
cd .ssh/
if [ ! -f authorized_keys ]; then touch authorized_keys ; chmod 600 authorized_keys ; fi
cat ~/dr-rsync-key.pub >> authorized_keys
rm ~/dr-rsync-key.pub

You’ll also need to make certain that ssh permissions on the cold instance either allow ssh access as root:

PermitRootLogin yes

Or at least allow the execution of remote commands via ssh as root with:

PermitRootLogin forced-commands-only

From your active instance, you can test that the rsync command works with:

rsync -avz --dry-run -e "ssh -i /root/cron/dr-rsync-key" /opt/cycle_server/ root@remotehost:/opt/cycle_server

That will do a dry run of an rsync. If you see the rsync command connecting and comparing the files you know things are setup properly.

With the connection configured and tested, you’re ready to deploy the periodic sync of CycleCloud from active to cold.

The first rsync call will synchronize the mostly static portions of CycleCloud. This includes any plugins that might have been installed and any customer configurations that may have been applied to your CycleCloud instance. A second rsync call is necessary to synchronize the dynamic portion of CycleCloud, the data store, using the periodic backups. We can create a simple script that takes care of this via a cron job:

rsync -avz --delete -e "ssh -i /root/cron/dr-rsync-key" 
        --exclude 'data/' 
        --exclude 'logs/' 
        --exclude 'license.dat' 
        /opt/cycle_server/ root@remotehost:/opt/cycle_server

With the static content synchronized, the backups of the dynamic data store content can be synchronized with:

rsync -avz --delete -e "ssh -i /root/cron/dr-rsync-key" 
        /opt/cycle_server/data/backups/ root@remotehost:/opt/cycle_server/data/backups

These two commands can be combined in to a single shell script that performs the rsync, logs the output and keeps the log file rotated. For example, the following script, /root/cron/dr_sync combines everything:

#!/bin/sh
COLD_INSTANCE=somehost
DIR=/root/cron
KEY=${DIR}/dr-rsync-key
LOG = ${DIR}/dr_sync.log
LOCK = ${DIR}/dr_sync.lock
CS_HOME=/opt/cycle_server

exec 9>${LOCK}
if ! flock -n 9  ; then
        echo "Another instance of dr_sync is already running";
        exit 1
fi

echo "----------" >> ${LOG}
echo `date` >> ${LOG}

rsync -avz --delete -e "ssh -i ${KEY}" 
        --exclude 'data/' 
        --exclude 'logs/' 
        ${CS_HOME}/ root@${COLD_INSTANCE}:${CS_HOME} 2>&1 >> ${LOG}

rsync -avz --delete -e "ssh -i ${KEY}" 
        ${CS_HOME}/data/backups/ 
        root@${COLD_INSTANCE}:${CS_HOME}/data/backups 2>&1 >> ${LOG}

logrotate /root/cron/dr_sync.conf

The dr_sync.conf file for logrotate looks as follows:

/root/cron/dr_sync.log {
 compress
 missingok
 copytruncate
 nocreate
 notifempty
 rotate 4
 size=5M
 daily
}

The suggested frequency for this sync is in the 30-60 minute range. An appropriate entry in root’s crontab for this script would look as follows:

0,30 * * * * /root/cron/dr_sync

Using robocopy

This approach is intended for use with a Windows-based setup. The robocopy command is a Microsoft-supplied tool for advanced copy that can do file system tree synchronization and WAN traffic shaping for large transfers between Windows machines.

The first robocopy call will synchronize the mostly static portions of CycleCloud. This includes any plugins that might have been installed and any customer configurations that may have been applied to your CycleCloud instance. A second robocopy call is necessary to synchronize the dynamic portion of CycleCloud, the data store, using the periodic backups. We can create a simple script that takes care of this via a Scheduled Task.

This example assumes the cold CycleCloud’s C:Program Files is mounted as Z:.

First, synchronize the static content:

robocopy "C:Program FilesCycleCloud" Z:CycleCloud /COPYALL /SL /PURGE /XD data logs /XF license.dat /E

With the static content synchronized, the backups of the dynamic data store content can be synchronized with:

robocopy "C:Program FilesCycleClouddatabackups" Z:CycleClouddatabackups /COPYAL /SL /PURGE /E

These two commands can be combined in to a single batch script.

Failing Over to Your Cold DR Instance

If your primary CycleCloud instance fails, you will need to start your DR instance. The first step is to restore the database from the backups you have synced. CycleCloud provides a restore utility to make this process easier. The restore utility is /opt/cycle_server/util/restore.sh on Linux and C:Program FilesCycleCloudutilrestore.bat on Windows.

The restore utility takes one argument: the target backup directory. In most cases, the most recent backup is the correct choice. If the primary CycleCloud instance became unavailable during the synchronization process, you may need to select the next-most recent.

Backups are stored in $CS_HOME/data/backups and are named backup-%Y-%m-%d_%H-%M-%S-%z (see the strftime(3) man page for more information). Once you have the most recent backup, you can run the restore utility:

% ./util/restore.sh data/backups/backup-2015-12-23_12-19-17-0500
Backup: data/backups/backup-2015-12-23_12-19-17-0500
Warning: All current data in the database will be replaced with the contents of this file.
Are you sure you want to restore the backup? [no] yes
Stopping...
Restoring database from data/backups/backup-2015-12-23_12-19-17-0500...
Restored database (0:00:05 sec).
Restarted.

The restore utility will restart CycleCloud and when it finishes booting, you will be able to begin using your DR instance.