Objectivity/DB Spark Adapter : Advanced Topics : Setting Up Storage on a Cluster
Setting Up Storage on a Cluster
When you run a Spark driver application that partitions its jobs across multiple worker nodes in a cluster, it is important to set up the federated database’s file storage in a way that minimizes network traffic across nodes.
Shortcuts for Automating Storage Setup
Objectivity/DB provides several shortcuts for setting up storage locations that support Spark jobs executing in a cluster:
Running the Cluster Storage Setup Script
Using Predefined Location Preferences
These shortcuts automate tasks that you can perform manually.
Note:These shortcuts apply only when you are setting up a new, empty federated database. When a Spark driver application accesses a pre-existing federated database, it simply continues to use the federation’s existing data files and storage locations.
Understanding Storage for Spark Jobs in a Cluster
When you create a new federated database, you should set up the storage locations for its files so that each Spark worker process executing on a particular node can write its new data files to a storage location on the same node. (See Understanding File Storage for Persistent Data for general information about setting up storage locations.)
Example Assume that a Spark driver application partitions its jobs to four Spark worker processes, each running on a different node in the cluster. For best performance, any database files created by a particular worker process should be stored in the storage location closest to (on the same host as) that process.
The storage setup therefore includes:
A federated database with a registered storage location on each host (represented as LocationA, Location B, LocationC, and LocationD in the diagram).
Location preferences such as those described in the table below that prioritize the closest registered storage location for each worker process.
 
Location Preferences for Worker1
Rank 1
The storage location on Worker1’s host — LocationA
Unranked
All others in the MSG. OK to use if ranked location is unavailable
Location Preferences for Worker2
Rank 1
The storage location on Worker2’s host — LocationB
Unranked
All others in the MSG. OK to use if ranked location is unavailable
Location Preferences for Worker3
Rank 1
The storage location on Worker3’s host — LocationC
Unranked
All others in the MSG. OK to use if ranked location is unavailable
Location Preferences for Worker4
Rank 1
The storage location on Worker4’s host — LocationD
Unranked
All others in the MSG. OK to use if ranked location is unavailable
Running the Cluster Storage Setup Script
The create_thingspan_fd.sh script automatically creates a federated database on the master node of a cluster, and then sets up a default storage location for each worker node in the cluster.
For information about performing the operations in this script manually, see Creating a Federated Database and Registering Storage Locations.
Before you run the script
Note:This procedure assumes that you have already installed ThingSpan and set the relevant environment variables on each node in the cluster.
1. Verify that every node in the cluster has a directory called installDir/data, where installDir is the ThingSpan installation directory.
2. Verify that a lock server is running on the master node of the cluster. If necessary, start the lock server; see Starting a Lock Server.
3. Verify that AMS is running on every node of the cluster. If necessary, start AMS; see Starting AMS.
4. Verify that your license file is set up in the ThingSpan installation directory of the master node of the cluster. If necessary, set up the license file; see Setting Up a License File.
5. Choose the federated database’s system name fdSysName. The system name will be the simple name of the boot file.
To set up cluster storage
To set up cluster storage for a federated database called fdSysName:
1. Open a shell prompt on the master node of the cluster.
2. In the shell prompt, execute:
installDir/bin/create_thingspan_fd.sh fdSysName
where installDir is the ThingSpan installation directory.
This script:
Creates the new federation’s system-database file (fdSysName.fdb) and boot file (fdSysName.boot) in the installDir/data directory on the master node.
Sets the master node as the lock-server host for the new federated database.
Registers the installDir/data directory on each worker node as a storage location for the new federated database.
Sets the installDir/data directory on the master node as the journal directory.
Creates a new, empty data file in each default storage location to “seed” the storage selection process when new data is written.
Using Predefined Location Preferences
Predefined location preferences rank local storage locations over remote storage locations. These location preferences are specified in the default Objectivity configuration file machine.config; see Default File for Machine-Wide Settings.
Installing ThingSpan on a worker node makes machine.config available automatically to any worker process running on that node. The example above assumes that ThingSpan has been installed on the master node and all four worker nodes, so five copies of machine.config exist (one for the driver application and each of its worker processes).
The predefined location preferences appear in machine.config as follows:
 
<?xml version="1.0" encoding="UTF-8"?>
<Objectivity>
...
  <LocationPreferences allowNonPreferredLocations="true">
    <LocationPreferenceRank>       <!-- only rank -->
      <LocalHost value="true"/>    <!-- prioritizes local storage -->
    </LocationPreferenceRank>
  </LocationPreferences> 
...
</Objectivity>
 
The single rank with the <LocalHost> element prioritizes all storage locations residing on the same host as the process that loads the preferences. As always, storage locations are considered only if they are registered in the federated database main storage group. The <LocalHost> element serves as a shortcut that enables identical copies of machine.config to be used on each node. (Otherwise, each copy on each node would have to rank a different location explicitly.)
You do not normally need to modify the predefined location preferences for a Spark driver application, although you can optionally specify additional or alternative ranks as you would for any other Objectivity/DB application; see Specifying Application-Specific Location Preferences.