Consult your system administrator as needed for the following prerequisites.
A Hadoop cluster running on Linux machines.
Loom 2.0+ has been tested on the following Hadoop distributions. Loom supports MRv2 (YARN) as well as MRv1.
Distributor
Version
Cloudera
CDH 5.1
Hortonworks
HDP 2.1
Teradata
TDH 2.1
Operating Systems: Linux. Loom has been run on Ubuntu, CentOS, RHEL, and SLES.
Browsers: Chrome and Firefox.
Choosing an installation location for Loom
On the cluster
It is recommended that you install Loom on the NameNode, for simplicity in managing permissions. However, Loom can be run on any node in the cluster.
Off the cluster
Loom can also be run outside the cluster a machine that can communicate with the Hadoop APIs but is not itself running any Hadoop services (commonly known as an “edge” node).
It is not necessary for users on the machine to be able to access HDFS from the command line, but this machine will need to have a copy of the same Hadoop distribution files as the cluster – in particular, the libraries for Hadoop, Hive, and HCatalog.
Local Username/Permissions
On both the machine where you still be running Loom and on all nodes in the cluster, create a dedicated Linux username for Loom. The alphanumeric ID, numeric user ID (UID), and group ID (GID) for the user must be the same across machines.
This user will be referred to as loomuser throughout this document, but it can have any name.
Depending on Loom security settings (see Advanced Configuration > Security), this will be the username interacting directly with Hadoop services.
Grant loomuser sudo privileges.
This is not absolutely necessary, but if you choose not to do so, you will need access to another username with sudo privileges in order to change ownership of the directory where Loom is downloaded.
On the machine where Loom will be running, grant loomuser ownership of the following local directory
file:/tmp/loomuser
The default location for local temporary files created by Hive when executed by the loomuser userid. This may be overridden in the “hive.exec.local.scratchdir” property of hive-site.xml.
Set HIVE_HOME, HADOOP_HOME, and HCAT_HOME environment variables for loomuser. These variables should be set permanently for loomuser, or specified in loom-server.sh, but should not just be set for the current shell session.
These variables should be set to the directories that contain the Hive, Hadoop, and HCatalog “lib” directories, respectively, and should NOT have a trailing slash.
The exact values will vary depending on your Hadoop distribution. Examples are below, but you should confirm that the Hive and Hadoop “lib” files are actually located at the paths below.
For TDH, the following additional environment variable is needed
PATH=$PATH:/opt/teradata/jvm64/jdk7/bin
Hadoop Username/Permissions
Grant loomuser read and write access to the following HDFS directory:
hdfs:/user/hive/warehouse
The default location of the Hive warehouse. This may be overridden in the “hive.metastore.warehouse.dir” property of hive-site.xml file.
Create and grant loomuser ownership of the following HDFS directories:
hdfs:/tmp/hive-loomuser
The default location for temporary files created by Hive when executed by the loomuser userid. This may be overridden in the “hive.exec.scratchdir” property of hive-site.xml.
hdfs:/user/loomuser
The home directory for loomuser on HDFS.
Grant loomuser read and write access to any HDFS directories where the user will want to browse, query, or output new data.
Hive
Install Hive with a multi-user metastore, such as MySQL or PostegreSQL.
If Hive was installed as a demo, it is probably using the default Apache Derby metastore, which is single-user. Your Hadoop distributor should have instructions on switching Hive to use a non-Derby metastore.
Networking
Ports: The port on which Loom will run (8080 by default, but you can specify any port at runtime) must be exposed such that intended users of Loom will be able to access that port through their web browser.
Web Browser
The latest versions of Firefox and Chrome are compatible with Loom. Internet Explorer is not supported.
First-time Installation
That is, on a cluster where Loom has never been installed:
Download and Install Loom
Open an SSH session on the machine where you are going to install Loom.
Create a loom directory wherever you want Loom installed (e.g. /usr/local), transfer ownership to loomuser, and cd into it.
For MapR users: you will need to uncomment (i.e. delete the pound sign before) the following line in bin/check-setup.sh, in order to include certain native dependencies.
Checking default loom port ........ port '8080' on host 'localhost' ... OK.
Checking availability of datomic transactor port ........ port '4334' on host 'localhost' ... OK.
Checking default Hadoop FileSystem ........ configured to use hdfs://localhost:8020 ... OK.
Checking default Hadoop JobTracker ........ configured to use JobTracker 'localhost' port '50030' ... OK.
Loom is ready to run.
If “default loom port” check fails:
The default port for Loom Server is 8080, but Loom can easily be run on a different port. Instructions are included in the documentation below, starting with the phrase “To run this server on a different port...”
If “availability of datomic transactor” check fails:
This means another application is running on port 4334, 4335, or 4336. If you cannot remove the application, it is possible to configure Loom to start the transactor on a different set of three contiguous ports. Open loom-x.y.z-distribution/lib/datomic/transactor.properties, and set ‘port’ to the first port in the sequence you want to use:
#free mode will use 3 ports starting with this one:
port=
You may also be seeing this error if you have started Loom on this machine before; as mentioned above, it is only necessary to run checkup.sh before the first time you start Loom. Once you start Loom, the transactor runs as a background process on ports 4334-4336, and will keep running on these ports in between restarts of the Loom server.
If “default Hadoop FileSystem” check fails: either you did not set HADOOP_HOME correctly (see Prerequisites > Username/Permissions) or HDFS is not running.
If “default Hadoop JobTracker” check fails, either you did not set HADOOP_HOME correctly (see Prerequisites > Username/Permissions) or JobTracker is not running.
Set Loom’s DistributedCache directory
In loom-x.y.z-distribution/config/loom.properties, set loom.dist.cache to the desired HDFS directory. It will default to hdfs:/user/${user.name}/loom-dist-cache unless otherwise changed, where ${user.name} is the name of the user who starts the loom server.
# uses to configure MapReduce jobs that it submits. The Loom server process
# must have permission to write in this location.
loom.dist.cache=
IMPORTANT: must BOTH be an absolute path (not a URI) for an HDFS folder AND ALSO end with a "/" For example:
/user/loom/ ACCEPTABLE
/user/loom NOT ACCEPTABLE
loom/ NOT ACCEPTABLE
hdfs://master:9000/user/loom/ NOT ACCEPTABLE
At this point, if you want to take advantage of Loom’s advanced configuration options, see the “Advanced Configuration” section and complete the relevant steps before proceeding to the next step below.
Start Loom
For MapR users: you will need to uncomment (i.e. delete the pound sign before) the following line in bin/loom-server.sh, in order to include certain native dependencies.
IMPORTANT: always run the loom-server.sh script from the current distribution directory, e.g. /usr/local/loom/loom-x.y.z-distribution. Loom has certain dependencies that require to be started from the distribution directory
These examples use ‘nohup’ plus ‘&’ to run Loom in the background. You can also run Loom from a ‘screen’ window, if you have the ‘screen’ package installed.
In loom-x.y.z-distribution/config/loom.properties, set loom.dist.cache to the desired HDFS directory. It will default to hdfs:/user/${user.name}/loom-dist-cache unless otherwise changed, where ${user.name} is the name of the user who starts the loom server.
# Sets the location in HDFS where Loom manages the distributed cache that it
# uses to configure MapReduce jobs that it submits. The Loom server process
# must have permission to write in this location.
loom.dist.cache=
IMPORTANT: must BOTH be an absolute path (not a URI) for an HDFS folder AND ALSO end with a "/" For example:
/user/loom/ ACCEPTABLE
/user/loom NOT ACCEPTABLE
loom/ NOT ACCEPTABLE
hdfs://master:9000/user/loom/ NOT ACCEPTABLE
See the “Advanced Configuration” section in this document for instructions on additional configuration options.
If you are upgrading Loom, you must stop the transactor processes. You can skip this step if you are simply restarting the Loom server, i.e. using the same distribution.
Start the new Loom server. IMPORTANT: always invoke the loom-server.sh script from the distribution directory, e.g. /usr/local/loom/loom-x.y.x-distribution directory. Loom has certain dependencies that require it to be started from the distribution directory.
Do not log into the Lab Bench or attempt to view or register data before finishing the next section.
Restore Registry
From the new distribution directory, restore the registry, using the backup.json file you created with the previous distribution. By default, =localhost and
=8080.
See loom-x.y.x-distribution/docs/Loom_Security.txt for details. You will need to restart Loom after making any Loom configuration changes, and restart Hadoop services after making any Hadoop configuration changes.
ActiveScan: Potential Sources
One of Loom’s features is the ability to detect “Potential Sources;” that is, regularly and recursively scan a specified HDFS directory to detect new files, which Loom displays in the ‘Sources’ Home page of the Loom Lab Bench (browser UI), as well as on the ‘Loom’ home page in the ‘Recent Sources’ column.
To turn on ActiveScan: Potential Sources, edit loom-x.y.z-distribution/config/loom.properties:
loom-x.y.z-distribution/config/loom.properties
# Enable active scanning of potential datasets in HDFS.
activeScan.dataset.enabled=true
# Set the top-level directory under which to scan for potential datasets
# in HDFS. May be specified as an absolute hdfs:// URL or a relative
# path that will be resolved against the Loom working directory.
activeScan.dataset.baseDir=loomInput ACCEPTABLE, if loomuser has a configured working directory
By default, Loom is set to scan the specified directory every 60 minutes, but you can change this:
loom-x.y.z-distribution/config/loom.properties
activeScan.dataset.scanIntervalMinutes=
You can also determine the size of the sample Loom will scan from each file, in terms of either number of rows (activeScan.hdfs.parseLines) or number of bytes (activeScan.hdfs.maxBufferSize). Loom will stop scanning as soon as it reaches one of those limits.
# The number of records to parse from a file in HDFS to determine whether it's a potential source.
activeScan.hdfs.parseLines=50
# The maximum amount of data to read into memory from an HDFS file to determine whether it's a potential source.
activeScan.hdfs.maxBufferSize=8388608
Once configuration changes have been made, start or restart the Loom server. Changes will not take effect otherwise.
Custom Metadata Properties
IMPORTANT: Read if you are restarting Loom and using custom metadata properties. If you meet both of the following conditions: 1) You used the Custom Metadata feature of Loom, i.e. removed, edited, or added properties to the CSV(s) in loom-x.y.z-distribution/schema and 2) you are planning to restore the registry which you previously backed up, then you must copy the contents of loom-x.y.z-distribution/schema from the old Loom distribution directory into the new directory. Otherwise, Loom will not be able to restore your registry due to a mismatch in registry structure.
Upon startup, Loom looks for CSVs in the directory loom-x.y.z-distribution/schema, and reads the properties defined therein. In order to remove, edit, or create properties for a given class of entities, you will need to edit the CSVs in loom-x.y.z-distribution/schema directory.
All CSVs must follow the naming format: meta-*.csv. For example: meta-user-extension.csv, meta-customproperties.csv.
Each CSV must use the following schema:
Column Name
Description
Examples
type
The entity type that the property is associated with.
Indicates whether the property refers to a single value or a list.
Must be 'one' or 'many.'
meta.attribute.ref/type
Only use this property if meta.attribute/valueType is set to ‘uuid,’ otherwise leave as null. meta.attribute.ref/type indicates the type of entity to which meta.attribute/valueType refers.
dataset/Dataset
meta.attribute/unique
‘value’ or ‘identity’ indicates that this property uniquely identifies the entity; that is, no 2 entities can share the same value for this property.
Must be null, 'value', or 'identity.'
meta.attribute/index
Indicates that this property should be indexed for fast lookups.
Must be 'TRUE' or 'FALSE.’
meta.attribute/fulltext
Only use this property if meta.attribute/valueType is set to ‘string,’ otherwise leave null. This property indicates whether meta.attribute/valueType should be indexed for text searches. [Note: Support for this feature is not included in Loom 1.1.3.]
Must be 'TRUE' or 'FALSE.'
meta.attribute/doc
The label that will be displayed in the Lab Bench; a text string describing the property.
Owned By
An example of correctly formatted custom properties for a Source, Dataset, Process, and Job: