Objectivity/DB Administration : Administration Tasks : Automatic and Manual Recovery
12 
Automatic and Manual Recovery
Recovery is the process of restoring a federated database to a consistent state after a transaction fails to commit. Depending on the nature of the failure, recovery is performed by the application that started the transaction or through one of several automatic recovery mechanisms that you enable.
This chapter describes:
General information about Objectivity/DB recovery mechanisms.
Enabling automatic recovery from application, client-host, or lock-server failures.
Performing manual recovery when automatic recovery is not possible.
About Recovery
Recovery is required in both of the following situations:
A read transaction obtains locks then terminates without releasing the locks.
An update transaction obtains locks, makes changes to data, and then terminates without committing the changes and releasing the locks.
Objectivity/DB recovers a transaction by:
Rolling back any uncommitted changes. This restores the federated database to the logical state it was in before the transaction started. Objectivity/DB uses the information recorded in one or more journal files to roll back changes.
Instructing the lock server to release all locks held by the transaction. (If the lock server has stopped, its locks are lost, so there are no locks to release.)
An executing Objectivity/DB application initiates recovery automatically whenever it aborts a transaction. An application may abort a transaction directly through its program logic, or indirectly through a signal or exception handler. Thus, if an application catches an interrupt during a transaction (for example, a user enters Control-c on Linux), the default Objectivity/DB signal handler aborts the transaction and rolls it back.
Sometimes a transaction terminates due to an application failure, client-host failure, or lock-server failure that does not result in an abort. For example, the failure may produce a signal or exception that the application cannot handle, or it may simply stop execution, preventing the abort from taking place. After such a failure, recovery must be performed by an Objectivity/DB process other than the one that started the transaction.
You can enable various Objectivity/DB processes to initiate recovery automatically after application, client host, and lock-server failures, as described in the following sections. When automatic recovery is enabled, manual recovery is normally necessary only when an Objectivity/DB host becomes permanently inaccessible after a failure.
Automatic Recovery From Application Failures
Objectivity/DB applications can fail in a number of ways, leaving incomplete transactions on a federated database. Application failures can be caused by:
A Windows application issuing an ExitProcess or TerminateProcess.
Quitting a debugging session during a transaction.
A SIGKILL signal, for example, from a kill -9 command.
No special arrangement is required for the automatic recovery of incomplete read transactions; if a failed Objectivity/DB application leaves any incomplete read transactions, the locks from these transactions are released immediately when the application’s connection to the lock server is broken.
You can arrange for the automatic recovery of incomplete update transactions by:
Enabling the lock server’s recovery monitor; see Using the Lock Server’s Recovery Monitor.
Automatic Recovery From Client-Host Failures
Client hosts can fail in ways that prevent an application’s exit or termination signal from being caught—for example, when:
The client host’s operating system fails.
The client host loses power.
The network connection between the client host and the lock server fails.
When a client host fails in one of these ways, Objectivity/DB applications running on the host may leave incomplete transactions on one or more federated databases.
You can arrange for automatic recovery from client-host failures using one or more of the following techniques:
You can enable the lock server’s recovery monitor, which periodically tests for client-host failures, and then recovers any transactions that originated on a failed host; see Using the Lock Server’s Recovery Monitor.
You can configure each client host so that it runs a recovery tool automatically after restarting; see Configuring a Client Host to Perform Recovery.
Unrecovered transactions can prevent any active applications from accessing objects in the same federated database. If you choose not to enable the lock server’s recovery monitor and you are not able to restart a failed client host in a timely fashion, you should perform manual recovery; see Manual Recovery From Client-Host Failures.
Using the Lock Server’s Recovery Monitor
You can enable its recovery monitor to initiate recovery of transactions. A recovery monitor is a thread within a lock server process that identifies and recovers incomplete transactions belonging either to:
Processes that ran on client hosts that are now no longer operational; see below.
Processes that have been disconnected from the lock server, although their client hosts are still operational; see also Automatic Recovery From Application Failures.
The recovery monitor checks whether client hosts are operational by attempting to make a network connection to each client host and then waiting for a response within a specified timeout period. If a client host does not respond for any reason (for example, due to machine or network problems), it is assumed to have failed, and all transactions started by its processes are automatically recovered.
Besides monitoring client-host states, the recovery monitor also listens for the error that occurs when an application fails and becomes disconnected from the lock server. Such an error triggers automatic recovery of all transactions started by the failed application process.
Enabling a recovery monitor is recommended whenever multiple applications access the same federated database from multiple client hosts, particularly if the network is unreliable. A recovery monitor can simplify system administration tasks by reducing the need for immediate manual intervention whenever a client host fails.
Enabling the Recovery Monitor
To enable the recovery monitor:
Enter the following command at a command prompt:
objy StartLockServer -monitor
Note: On Windows, you must start the command prompt as administrator.
By default, the recovery monitor tests each client host every 60 seconds and waits 30 seconds for a response. You can adjust these times by adding the -interval and -timeout options, respectively.
Example This command enables the lock server’s recovery monitor with an interval of 45 seconds and a timeout period of 35 seconds.
objy StartLockServer -monitor -timeout 35 -interval 45
 
Checking for an Enabled Recovery Monitor
You can run the CheckLs tool to find out whether a lock server on a particular host is running with the recovery monitor enabled.
Guidelines for Configuring the Recovery Monitor
Use the following guidelines when choosing a test interval and timeout period:
The recovery monitor’s timeout period should be at least as long as the network timeout period used by your applications; see Lock-Server Timeout and AMS Timeout.
The recovery monitor’s test interval and timeout period together should be shorter than the amount of time each application waits for lock requests to be granted (that is, the application’s lock-waiting timeout period).
For example, assume your applications are programmed to wait at least two minutes to obtain locks. By default, the recovery monitor detects failed hosts and performs recovery every minute and a half (60+30 seconds), so stale locks are released before lock requests from the applications time out.
Recovery of a Transaction While the Application is Still Running
A recovery monitor cannot guarantee that a client host has actually failed—only that the host has failed to respond within the specified time limit. Consequently, it is possible for the monitor to occasionally recover a transaction that belongs to an application that is still running. When this happens, the lock server disconnects from the application to prevent any further activity in the transaction; an error is signaled in the application, which aborts the transaction. The application automatically reconnects to the lock server by starting a new transaction.
When Recovery is Not Possible
Sometimes a recovery monitor cannot recover a transaction—for example, because a failed data-server host prevents access to the data files in which changes are to be rolled back. The recovery monitor makes no further attempt to recover such transactions, even while recovering other transactions left by subsequent client-host failures. Any unrecovered transactions must be cleaned up manually when the necessary files are accessible again; see Performing Manual Recovery.
The recovery monitor records recovery errors in the lock server’s message log; see Message Logging.
Configuring a Client Host to Perform Recovery
You can configure each client host to perform automatic recovery by running CleanupFd after being restarted. The CleanupFd tool causes Objectivity/DB to roll back any incomplete transactions that exist on the specified federated databases. If the lock server is still running, the locks held by these transactions are released. If the lock server is not running, the CleanupFd command will fail because it requires the -standalone option to run without a lock server; in this case, you can run CleanupFd from a command prompt.
The following subsections describe details for specific platforms.
Windows
You can configure a Windows client host to run CleanupFd whenever a user logs in:
For each federated database that is accessed by applications running on the client host, add the following command to the Startup program folder:
objy CleanupFd -local -bootFile bootFilePath
If the client host is also the lock-server host, then rebooting the host restarts the lock server automatically as a Windows service. You should configure the lock server to initiate recovery for every federated database accessed by an application that runs on the host; see Performing Recovery at Lock-Server Startup. Adding CleanupFd to the Startup program folder is redundant but not harmful.
Linux
You can configure a Linux client host to run CleanupFd whenever the system reboots:
For each federated database that is accessed by applications running on the client host, add the following command to the startup script (usually /etc/rc.local):
objy CleanupFd -local -bootFile bootFilePath
If the client host is also the lock-server host, you should check whether the host’s startup script runs the lock server:
If the startup script runs the lock server, you should configure the lock server to initiate recovery for every federated database accessed by an application that runs on the host; see Performing Recovery at Lock-Server Startup. You do not need to add the CleanupFd command to the startup script.
If the startup script does not run the lock server, you should either add the lock server to the script or add the CleanupFd command with the -local and -standalone options.
Automatic Recovery From Lock-Server Failures
If a lock server or its host fails while transactions are in progress, the failure leaves incomplete transactions on one or more federated databases. You initiate automatic recovery from lock-server failures by restarting the lock server. The way you start the lock server determines when recovery is performed, as described in the subsections below.
If a lock-server host becomes permanently inaccessible, incomplete transactions cannot be recovered automatically. To recover from this situation, see Manual Recovery From Lock-Server Host Failures.
When a lock server stops, the locks it manages are lost, so recovery just rolls back uncommitted changes. If uncommitted changes cannot be rolled back (for example, because a data-server host is inaccessible), the lock server reestablishes the appropriate locks.
Performing Recovery at Lock-Server Startup
You can cause automatic recovery to be performed immediately after the lock server restarts.
To cause the lock server to perform automatic recovery at startup for two federated databases (bootFilePath1 and bootFilePath2):
Enter the following command at a command prompt:
objy StartLockServer -recoverBootFile bootFilePath1 -recoverBootFile bootFilePath2
Note: On Windows, you must start the command prompt as administrator.
You can specify boot files for any or all of the federated databases that are serviced by the lock server.
When the lock server starts, Objectivity/DB rolls back all incomplete transactions on the specified federated databases. New transactions may experience a delay until recovery is complete.
If the lock server services multiple federated databases and you specify only some of them explicitly, the specified federated databases are recovered when the lock server is restarted. Each of the remaining, unspecified federated databases is recovered the next time an application accesses it, as described in “Performing Recovery When Locks are Requested” below.
Performing recovery at lock-server startup is recommended in a distributed environment because you can ensure that the lock server gets boot-file pathnames it can resolve; if the lock server cannot resolve a pathname, you get an error message when the lock server starts.
Performing Recovery When Locks are Requested
You can delay the automatic recovery of each serviced federated database until data is requested from it. To do this:
Start the lock server without specifying any boot-file paths:
objy StartLockServer 
Note: On Windows, you must start the command prompt as administrator.
Recovery is initiated on a particular federated database when its boot-file name is passed by the application to the lock server. Thus, the first time a federated database is accessed by an application after the lock server restarts, Objectivity/DB rolls back any incomplete transactions on that federated database. Recovery from the lock-server failure proceeds independently of whether recovery is enabled or disabled in the application.
Automatic recovery is performed only once per federated database during the lifetime of a particular lock-server process. If a federated database is recovered when the lock server starts, the lock server will not recover the same federation again when an application later accesses it. (Subsequent recovery may be performed by a recovery-enabled application, however.)
Guidelines for Lock-Server Setup
The following subsections provide guidelines for setting up the lock server so that it can perform automatic recovery, either with or without an enabled recovery monitor.
Access Required by the Lock Server
The lock server must be able to access all the Objectivity/DB files for every federated database being serviced. Therefore, you must ensure that:
The lock-server host can access all file systems containing the relevant files.
The lock server has sufficient permissions to:
Read the boot file for each federated database being serviced.
Read and write all data files and journal files, and the directories containing them.
Setting Up Recovery in Mixed Environments
The lock server must be able to resolve every boot-file pathname that is passed to it by a tool or application. (A boot-file pathname passed by an application is taken from the application’s connection to the federated database, and expanded, if necessary, to the form host::full_path.).
The lock server must be able to resolve the names of every data file and journal directory that is listed in a specified boot file or in the federated database’s catalogs.
In general, the lock server consults the local file system to resolve the names of local files (or remote Windows Network files referenced by UNC share names). The lock server contacts either AMS or NFS on remote hosts to find all other remote files.
When Objectivity/DB is distributed among network nodes of different architectures, you should consider the following guidelines to enable the lock server to access all of the files it needs:
Always start a lock server with one or more explicitly specified -recoverBootFile options. This way you can ensure that the lock server gets a pathname it can resolve. Furthermore, if the lock server cannot resolve a pathname, you get an error message when the lock server starts.
Set up all data servers to run either AMS or NFS.
If data servers run AMS (recommended), you can use host-specific pathnames in the location properties for Objectivity/DB files.
If data servers run NFS, you can use NFS pathnames in the location properties for Objectivity/DB files.
If you set up Windows data servers to use Windows Network without AMS:
Run the lock server on a Windows node that recognizes the UNC share names used by your Windows applications. Do not use a Linux node as the lock-server host because it will not be able to resolve the UNC share names.
Be sure the lock server logs on to an account that can use the UNC share names.
Performing Manual Recovery
In rare circumstances, Objectivity/DB may not be able to recover a federated database automatically. In these situations, you can recover a federated database manually using the procedures described in this section. Possible scenarios that require manual recovery after a failure are summarized in the following table.
 
Scenario
Manual Recovery Needed1
For More Information
An application fails, and automatic recovery is not enabled in this application.
On the client host, run CleanupFd with the -local option.
A client host with no Objectivity/DB files becomes permanently inaccessible or cannot be restored in a timely fashion.
On another host, run CleanupFd with the -deadHost option.
The lock-server host becomes permanently inaccessible.
Run CleanupFd with the -standalone option, or start a new lock server on a host that you renamed to the hostname of the failed lock-server host.
A data-server host fails temporarily, preventing access to Objectivity/DB files needed for automatic recovery.
Run CleanupFd with the appropriate options after the data-server host is restored.
A data-server host becomes permanently inaccessible.
Restore the federated database from a backup archive.
1. The oocleanup option listed for each scenario is the minimum required in addition to the bootFilePath
About Manual Recovery With oocleanup
Most of the manual recovery scenarios require that you run CleanupFd with the -bootFile option and an additional recommended option. If you encounter a problem that combines two or more of these scenarios, you combine the recommended options into a single command. For example, if the same computer is both a client host and the lock-server host, and that computer becomes unavailable, you should run CleanupFd with the -deadHost option, the -standalone option, and the -bootFile option.
The CleanupFd tool uses journal files to roll back the uncommitted changes made by incomplete transactions. If the lock server is still running, CleanupFd also releases the locks held by the incomplete transactions.
You must run CleanupFd under a user account that can:
Read the boot file.
Read or write all journal files and the directories that contain them.
Read or write all data files.
Whenever possible, you should run CleanupFd on the same machine as the process that started the transaction to be recovered. This allows Objectivity/DB to verify that the process is no longer active, so it is safe to recover the transaction. If the state of a process cannot be verified (for example, because its client host or network connection have failed), Objectivity/DB will not recover any transaction started by that process without confirmation.
Warning:CleanupFd may not successfully roll back a transaction if the transaction terminated while databases were being deleted. Under these circumstances, the federated database may remain in an inconsistent state. Currently, the only way to recover from such a situation is to restore the federated database from backup archives.
Manual Recovery From Application Failures
If an application leaves incomplete transactions after terminating abnormally, you can initiate automatic recovery by starting a recovery-enabled application that accesses the same federated database. If, however, automatic recovery is not enabled in any applications, or you cannot restart a recovery-enabled application in a timely fashion, you must manually recover the incomplete transactions.
Note:If the application fails due to client host or network failures, you must recover its transactions as described in “Manual Recovery From Client-Host Failures” below.
To manually recover all incomplete transactions started by applications that are known to no longer be active, use CleanupFd with the -local option and the -bootFile option.
Note: You should check whether all incomplete transactions were recovered by CleanupFd. If some were not recovered, they might belong to active applications. To manually recover those incomplete transactions, run CleanupFd with the -process option, the -local option, and the -bootFile option.
To manually recover a particular incomplete transaction, use CleanupFd with the -transaction option and the -bootFile option.
To obtain a list of incomplete transactions, run CleanupFd with just the -bootFile option.
Example This command recovers the transaction 357254 for the federated database named mfgFD.
objy CleanupFd -transaction 357254 -bootFile mfgFD
The CleanupFd tool checks to make sure the process that owns the transaction is terminated before performing the recovery.
 
Manual Recovery From Client-Host Failures
Failure of a client host can cause Objectivity/DB applications to leave incomplete transactions. You normally arrange for automatic recovery from client-host failures by enabling the lock server’s recovery monitor. Or, you can set up client hosts or Objectivity/DB applications to perform automatic recovery, so that when a client host fails, you can initiate recovery by rebooting the host and restarting its applications.
If the client host is permanently inaccessible (for example, due to a disk failure or other hardware problem), or if the client host cannot be rebooted in a timely fashion, you can manually recover the incomplete transactions.
To manually recover all the incomplete transactions that originated on a failed host, run CleanupFd on another host, specifying the failed host to the ‑deadHost option.
To manually recover a particular incomplete transaction, you can run CleanupFd on another host, specifying the -transaction option.
Note:If the client host becomes permanently inaccessible, and this host also contains Objectivity/DB files, recovery cannot roll back any updates to those files. You must restore the federated database from a backup archive; see Restoring From a Backup.
Manual Recovery From Lock-Server Host Failures
If a lock-server host fails, but not permanently, you do not need to perform manual recovery. Instead, you should restart the lock server as described in Performing Recovery at Lock-Server Startup.
If a lock-server host becomes permanently inaccessible, you must recover incomplete transactions manually. To recover when a lock-server host becomes permanently inaccessible, you:
1. Use CleanupFd with the -standalone option, in addition to any other required options. This option allows CleanupFd to run without a lock server.
Note: When a lock server stops, the locks it manages are lost, so recovery just rolls back uncommitted changes.
2. Change the lock-server host for the federated database; see Changing Lock-Server Hosts.
3. If necessary, start the lock server on the new lock-server host; see Starting a Lock Server.
Alternatively, you can start a new lock server on a host that you have renamed to the hostname of the failed lock-server host.
Manual Recovery From CleanupFd Failures
Objectivity/DB allows only one recovery activity to occur against any given federated database at a time. To accomplish this, Objectivity/DB places a recovery lock on a federated database—each time a recovery operation is run, a file named oorecvr.LCK is placed in the journal directory and is deleted when recovery is complete. While this file exists, all other attempts to run recovery against the same federated database will fail.
If an CleanupFd process fails, the oorecvr.LCK file may be left behind. To recover from this failure, you delete this file by running CleanupFd with the -resetLock option.