[Mahout] Deploying custom drivers to mahout

Developing custom drivers on Mahout is fairly straightforward. You can inherit from MahoutDriver for Java drivers and MahourSparkDriver for spark drivers.

The Javadoc for MahoutDriver (if you can find it) provides a good summary of how to implement it

General-purpose driver class for Mahout programs. Utilizes org.apache.hadoop.util.ProgramDriver to run main methods of other classes, but first loads up default properties from a properties file.

To run locally:

$MAHOUT_HOME/bin/mahout run shortJobName [over-ride ops]

Works like this: by default, the file “driver.classes.props” is loaded from the classpath, which defines a mapping between short names like “vectordump” and fully qualified class names. The format of driver.classes.props is like so:

fully.qualified.class.name = shortJobName : descriptive string

The default properties to be applied to the program run is pulled out of, by default, “.props” (also off of the classpath).

The format of the default properties files is as follows:

  i|input = /path/to/my/input
  o|output = /path/to/my/output
  m|jarFile = /path/to/jarFile
  # etc - each line is shortArg|longArg = value

The next argument to the Driver is supposed to be the short name of the class to be run (as defined in the driver.classes.props file).

Then the class which will be run will have it’s main called with

main(new String[] { "--input", "/path/to/my/input", "--output", "/path/to/my/output" });

After all the “default” properties are loaded from the file, any further command-line arguments are taken in, and over-ride the defaults.

So if your driver.classes.props looks like so:

org.apache.mahout.utils.vectors.VectorDumper = vecDump : dump vectors from a sequence file

and you have a file core/src/main/resources/vecDump.props which looks like

  o|output = /tmp/vectorOut
  s|seqFile = /my/vector/sequenceFile

And you execute the command-line:

$MAHOUT_HOME/bin/mahout run vecDump -s /my/otherVector/sequenceFile

Then org.apache.mahout.utils.vectors.VectorDumper.main() will be called with arguments:

You can also implement it slightly differently by just dumping the jar into the mahout home directory and naming it starting with “mahout-” i.e. mahout-mydriver.jar

Hope this helps someone

Setting up Mahout in Linux

A few simple steps to get Mahout running in Linux. This is mostly about the bash script to get it to run easily

You’ll need to install Java first, then download and unpack the mahout distribution.

I then placed it in /usr/local/mahout

To be able to run Mahout from the path, the following bash script was placed in /usr/local/bin

Update the paths as relevant

#!/bin/bash
export MAHOUT_JAVA_HOME=/usr/lib/jvm/java-8-oracle/jre/
export MAHOUT_HOME=/usr/local/mahout
export MAHOUT_HEAPSIZE=4000
export MAHOUT_LOCAL=y