Objectivity/DB Spark Adapter : Advanced Topics : Setting Up Storage on HDFS
Setting Up Storage on HDFS
When a Spark driver application needs to write and read very large volumes of persistent data, you can arrange for the application to access storage on a Hadoop cluster. A Hadoop cluster consists of networked host computers running the Hadoop Distributed File System (HDFS). HDFS is designed for efficiently handling petabytes of data in very large data files.
Understanding Storage on HDFS
You arrange to store persistent data in a Hadoop cluster by creating an Objectivity/DB federated database on that cluster. As usual, storage consists of a system-database file, a boot file, one or more database files, and one or more journal files.
As shown in Figure 5‑1, all of these files can be stored on the Hadoop Distributed File System (HDFS) except for the journal files, which must reside directly on a file system other than HDFS. The journal directory (storage location for the journal files) is typically on the same host as the lock server and the client application, but could be on some other host running AMS.
Figure 5‑1 Typical Storage Setup on a Hadoop Cluster
When a federated database is created on a Hadoop cluster, HDFS manages all of the storage locations for the database files. Accordingly, you do not need to set up storage locations as you would for a federated database on a non-HDFS file system.
Note:A Hadoop cluster may, but need not, consist of the same set of networked machines as the Spark cluster on which your Spark driver application will run.
Deciding Whether to Use HDFS
HDFS is optimized for applications that “write once, read many times”—that is, for applications that primarily ingest data, persist it, and then read it back into memory, typically by performing queries. If your Spark driver application will need to make frequent random-access write operations to its persistent data, you should consider storing data files directly on the file systems of the computers running the cluster’s worker nodes, unless the files are too large to be managed efficiently on these file systems.
Another consideration is that using HDFS means you cannot coordinate the storage locations of a federated database’s data files with the Spark worker processes that access them—that is, you cannot set up the federated database so that each worker process writes its data files locally on the same host, as described in Setting Up Storage on a Cluster.
Instead, when HDFS manages the storage locations of a federated database’s data files, worker processes are likely to access data files stored on remote hosts, which can adversely affect performance. However, any performance deficit is usually offset if the data files become too large to be managed efficiently without HDFS.
Note:You must create new storage directly on HDFS; an existing federated database cannot be moved to HDFS from a non-HDFS file system.
Specifying Objectivity/DB Files on HDFS
To access a file belonging to an Objectivity/DB federated database stored on HDFS, you typically use a URI of the following format:
Prefix indicating the use of HDFS.
Symbolic name or IP address of the host running the NameNode service in the Hadoop cluster.
Port number associated with the NameNode service running on nameNode.
Path to a file stored in a directory created on HDFS. If the path has multiple components, separate them with a slash / .
Example The following URI identifies the boot file myFd.boot in the directory /projects/bigData, which is managed by an HDFS name node hostA.
Note:Some Objectivity/DB tools provide pairs of options for specifying the host components (hdfs://nameNode:port) separately from the path.
Alternative Naming Format
You can specify the name of a boot file or a storage location in an alternative host::path format, which separates the host components from the path using a double colon:
Example The following pathname in host::path format identifies the same boot file as the previous example.
Creating a Federated Database on HDFS
Note:This procedure assumes that you have already installed and set up ThingSpan on the master node and on each worker node of your Spark cluster.
To create a federated database on HDFS:
1. Open a command prompt on the master node (or only node) of the Spark cluster. Use this prompt for running tools in the following steps.
2. Run objy CheckLs to verify that the lock server is running on the master node. If necessary, start the lock server; see Starting a Lock Server in Objectivity/DB Administration.
3. Run objy CheckAms to verify that AMS is running on the master node. If necessary, start AMS; see Starting AMS.
Note: AMS does not need to be running on any HDFS nodes.
4. Verify that your Objectivity license file is set up in the ThingSpan installation directory on the master node. If necessary, set up the license file; see Setting Up a License File.
5. Choose the federated database’s system name fdSysName. The system name will be the simple name of the boot file.
6. Run objy CreateFd with the following options:
-fdName fdSysName
System name for the new federated database.
-fdDirHost hdfs:://nameNode:port
Host for the system-database file and boot file. Specify an HDFS name node and port number; see Specifying Objectivity/DB Files on HDFS.
-fdDirPath hdfsPath
Path to an HDFS directory in which to create the system-database file and boot file.
-jnlDirHost host
Host for the journal directory. Specify a host running a non-HDFS file system. The host must also be running AMS.
-jnlDirPath path
Path to a non-HDFS directory in which to create the journal directory.
-lockServerHost anyHost
Host of the lock server that is to service the new federated database. May, but need not be, be a host in the Hadoop cluster.
Example The following command creates a new federated database on a Hadoop cluster. In this example, hostA is the symbolic name of the host running the NameNode service, and hostB is some other host outside the Hadoop cluster.
objy CreateFd -fdName myFD -fdDirHost hdfs://hostA:9000 -fdDirPath /projects/bigData -jnlDirHost hostB -jnlDirPath /usr/smith/data -lockserverhost hostB
Note that hostA also runs its own file system as well as HDFS, and that file system can be used for the journal directory. In this example, the journal directory host is set to hostA without the hdfs:// prefix.
objy CreateFd -fdName myFD -fdDirHost hdfs://hostA:9000 -fdDirPath /projects/bigData -jnlDirHost hostA -jnlDirPath /usr/smith/data -lockserverhost hostB
Specifying a Federated Database on HDFS
You specify a federated database to administrative tools by specifying its boot file. When the federated database is stored on HDFS, you simply specify an HDFS URI in the format shown in Specifying Objectivity/DB Files on HDFS.
Example The following command deletes the federated database from the HDFS file system:
objy DeleteFd -bootFile hdfs://hostA:9000/projects/bigData/myFd.boot
Connecting to a Federated Database on HDFS
You specify a federated database to a Spark driver application by specifying its boot file as an option to the Spark reader or writer. When the federated database is stored on HDFS, you simply specify an HDFS URI in the format shown in Specifying Objectivity/DB Files on HDFS.
Example The following Scala code specifies a federated database that was created on HDFS:
  option("objy.bootFilePath", "hdfs://hostA:9000/projects/bigData/myFd.boot").
  option("objy.dataClassName", "com.thingspan.spark.demo.Customer").
Configuring Authentication for Access from Outside HDFS
If your Spark driver application is running on a Spark master node that is outside the Hadoop cluster, you must provide an HDFS configuration file that contains authentication information for accessing the HDFS server. The authentication information consists of the user name and token that were established when you set up HDFS.
Note:You do not need to set up authentication if you are running your Spark driver application on a node in the Hadoop cluster.
To set up authentication information for HDFS
1. Create a text file called hdfs.config in your home directory.
2. Edit the file so that it contains the following XML tags with your user name and token:
3. Set appropriate access permissions on the hdfs.config file.
Using a Nondefault Name for the Configuration File
The default behavior is to look for an HDFS configuration file named hdfs.config in your home directory.
You can use a different filename for the HDFS configuration file. To do so, you perform the following steps on each node that runs a Spark driver application or worker process:
1. Find the plugin specification file that extends the Spark driver to use HDFS, and open it for editing:
where installDir is the ThingSpan installation directory.
2. Change the value associated with configFileName to specify the filename for your .config file.
<Value name="configFileName" value="fileName.config"/>
Note:If you do not want to put the HDFS configuration file in your home directory, you must put a copy of it in the current working directory for the Spark driver application and for each worker process.
Tuning Storage on HDFS
As part of its basic architecture, HDFS splits the files it stores into blocks, which are distributed among the data nodes of a Hadoop cluster. The name node within the cluster maps the blocks onto the data nodes. This architecture is optimized for append-only write operations.
This basic architecture is augmented to support database services such as concurrent access by multiple applications and random-access write operations that perform updates to existing persistent data. Each Objectivity/DB database file is divided into segments, where each segment maps to an HDFS file that is composed of blocks, shown in Figure 5‑2.
Figure 5‑2 Objectivity/DB Database Segments and HDFS File Blocks
Segments provide enough context to perform random-access writes without having to load an entire database file into memory. That is, the segment(s) that are being modified can be identified, read from disk, updated, and written back to disk. The relevant segments are held in a segment cache during these operations. For efficiency, cached segments are written back to disk only when necessary.
Each segment of a database file contains a portion of a single Objectivity/DB container, which is a logical subdivision of a database. A container contains persistent objects that are normally accessed together and therefore must be locked together. An entire container can fit in a single segment, or may be split into multiple segments. When a persistent object in a container is accessed, all of the segments belonging to the container are locked; the lock is communicated to all of the relevant HDFS files.
Parameters You Can Tune
You can tune various aspects of the way the Spark adapter interacts with HDFS. You do this by setting named values in the plugin specification file that extends the Spark adapter to use HDFS:
where installDir is the ThingSpan installation directory.
The following table lists the named values in HdfsDatastore.plugin:
Value Name
Value Description
Size of the segment cache, expressed as the number of segments it can hold.
When the segment cache is full and more space is needed, the blocks in the cached segments are written to HDFS. The optimal segment cache size is generally correlated with the number of segments involved in write transactions. A segment cache that is smaller than the write-transaction size results in unnecessary synchronization of the segment cache to the HDFS disk.
Size of the blocks in an HDFS file, expressed as a number of bytes.
HDFS constraints on block size:
Must be greater than or equal to the minimum block size required by HDFS (dfs.namenode.fs-limits.min-block-size)
Must be a multiple of io.bytes.per.checksum; generally 512 bytes.
Objectivity/DB constraint on block size:
Must be a multiple of the page size. Pages are minimal units of IO within an Objectivity/DB file. Page size is set by the placement model of the federated database. The default page size is 8192 bytes.
Number of HDFS blocks to assign to each database segment.
Time in milliseconds that a Spark driver application waits on a request to an HDFS server, before throwing an exception.
Some aspects of performance depend on your HDFS setup. For example, the performance of write operations is inversely proportional to the number of data replications—that is, a write operation performs faster when HDFS has fewer replications.