July 24, 2012

Getting started with examples from "Mahout in Action"

I decided to write this post because I saw several similar questions on how to start to work with examples from "Mahout in Action" book (I was technical proofreader for it, and familiar with examples ;-).

Preparations

Complete source code of examples from book is available at separate repository at Github, together with short instruction on how to use them.  Please, note that book was written & tested for Mahout 0.5 - stable release, that existed at time of publishing, and master branch in repository contains code for this version.  There are also separate branches for code that was modified to work with Mahout versions 0.6 and 0.7 - they are named accordingly.To obtain code, you can use either Git, or use Github's "download source" functionality.  Here are links for all existing versions: 0.5 (master), 0.6, 0.7 - download and unpack archives to some location.
To work with examples, you need to have Apache Maven installed (it's better to install it from repository on Mac OS X or Linux systems). Maven is used to compile source code and to create packages. Maven project could be also imported into your favorite Java IDE - Eclipse, Netbeans, or Idea (I will explain how to use Eclipse, but for other IDEs the process is similar). To use Maven with Eclipse, you need to have m2eclipse plugin installed - it will provide import and build functionality.
To run examples from chapter 16, you'll also need to have Apache Zookeeper installed - see instructions in README file in repository - they're pretty detailed.
You also need to download Mahout distribution to run some examples (usually they involve execution of mahout script). Download file mahout-distribution-<version>.tar.gz and unpack it. You can also download file mahout-distribution-<version>-src.tar.gz, although this isn't necessary (it contains Mahout's source code).
I just want to mention, that Mahout is works best on Unix-based systems -- all examples were tested on Mac OS X & Linux. This also applied to Hadoop, so if you're using Windows, it could be better to install Linux in virtual machine and use it for all work.

Build examples

To be able to run examples, you need to build packages (jar files). From directory where source code for examples is located (you should have file pom.xml in this directory) execute following command:
   mvn package
It will compile source code and create packages. Compiled packages are stored in the target directory. There are several files created:
  • mia-<version>.jar contains only examples, to run them you need to specify all dependencies;
  • mia-<version>-jar-with-dependencies.jar contains examples plus all dependencies - this jar could be run without specifying additional classpath elements;
  • mia-<version>-job.jar contains examples plus all dependencies, excluding Hadoop -- it should be used for Hadoop jobs.
Use corresponding packages when book refers to them.

Import of example's source code into Eclipse

Import of code into Eclipse is very easy - go to menu File, select Import... item, and then unfold Maven and select Existing Maven Projects from list and press Next.  Eclipse will ask you where source code is located - point to directory where you unpacked examples - Eclipse will analyze pom.xml and will display string like: /pom.xml com.manning:mia:0.5:jar, you can press Finish after that.
After import, project will be opened in Eclipse, and you can look into source code, modify examples if you need, and execute them (see below).
If you need, you can also import source code of Mahout itself into Eclipse, the procedure is similar, but this may work for all releases - in some cases, it will give you error that some plugins aren't covered by m2eclipse - you can select Ignore item in Quick fix menu (when you click right mouse button).

How to run examples

You can run examples either from command line, or directly from Eclipse.

Run from Eclipse

To run example from Eclipse, select needed class from browser on left, click right mouse button on it, select Run as..., and from sub-menu, select Java Application.
Take into account, that some classes need to have additional parameters specified - you can customize this by selecting Run configurations item from Run as... sub-menu.
For example, code from chapter 2, expects that file intro.csv is located in current directory (top of the project), while it's located together with source code, so execution without explicit configuration will lead to error. To fix this problem you need to specify that working directory for these examples is in non-default place - go to Run configurations, and select Arguments tab in dialog window. Then change Working directory parameter from Default to Other, press Workspace... button, and select src/main/java/mia/recommender/ch02 directory from tree view. After that you can press Run button, and your example will be executed without error.

Run from command-line

You can run examples from command line either by using java directly, or by using Maven's exec plugin.
To run examples with java, you need to specify package with all dependencies in classpath, and specify class name to execute, like this:
   java -cp target/mia-0.5-jar-with-dependencies.jar mia.recommender.ch02.IREvaluatorIntro
But to run like this, you need to have package recompiled if you did some changes. From this perspective, Maven's exec plugin is more handy - it automatically recompile changed code, and executes it without packaging everything once again.  To execute you class with need to issue following command (for this example, you need to copy intro.csv file to top-level directory, or it will fail):
   mvn exec:java -Dexec.mainClass="mia.recommender.ch02.IREvaluatorIntro"
If your class accepts command-line parameters, then you can specify them using exec.args parameter of plugin:
   mvn exec:java -Dexec.mainClass="mia.recommender.ch02.IREvaluatorIntro" -Dexec.args="src"

Conclusion

So, I hope, that this article helped you to start with Mahout in Action examples. Most of examples should work as described here, but some requires more work, but you can find instructions for them in the README file in source code repository.
If you're still having questions, then I try to answer them ;-)

19 comments:

Rutu Mulkar said...

Thanks for your post. Great writeup. This might be a silly question, but where is the output printed when you execute using maven. I am running the first example in Ch2, and am new to maven.

Alex Ott said...

It will print everything to console, although it can print a lot of necessary information, like "preparing exec:java", etc. But you can make it quiet with -q option.
Although this isn't always handy - for example, if it will be error during execution, then you'll see only "Build failed" message. In this case, you'll need to re-run code without -q option to see backtrace.

P.S. for ch02 examples, you need to copy intro.csv file into top level directory, or examples won't find it.

Rutu Mulkar said...

I see it embedded in the verbose output now, thanks! I didn’t know where to look for it! :-)

Lakshmana Swamy said...

Thanks. It's helpful.

johnpaul said...

where do we get the intro.csv file

Alex Ott said...

it's in repository, in src/main/java/mia/recommender/ch02/intro.csv

Dr Frederic Stahl said...

Thank you Alex, I successfully executed some of the mia examples thanks to your blog!

However, could you please explain how mia knows where to find the mahout libraries? I can't find them anywhere in the mia eclipse project.

Also how would I go about if I wanted to create my own application using Mahout?

many thanks

Fred

Fred said...

Thank you Alex, I successfully executed some of the mia examples thanks to your blog!

However, could you please explain how mia knows where to find the mahout libraries? I can't find them anywhere in the mia exlipse project.

Also how would I go about if I wanted to create my own application using Mahout?

many thanks

Fred

Alex Ott said...

Eclipse knows where to find Mahout libraries, because it uses information from Maven, and maven automatically downloads all necessary dependencies.

If you want to create new project, then create new Maven project, and add Mahout core library as dependency. You can use org.apache as groupId, mahout-examples as artifcatId, and 0.7 as version.

You can find more information about maven at http://www.sonatype.com/books/mvnref-book/reference/

Fred said...

Hi Alex,

thanks for your help! I have now set up my own maven project and I used the code from listing 2.1 just to see if I can set up a recommender myself. Maven seems to pick up all the dependencies needed to compile the code, however there seem to be some runtime dependencies that are not picked up by maven. I get the following exception at runtime:

Exception in thread "main" java.lang.NoClassDefFoundError: org/slf4j/LoggerFactory

Do you have any suggestions?

I really appreciate your help. Mahout seems like a powerful tool.

Fred

Alex Ott said...

it looks like org.slf4j wasn't added to dependencies - can you check that you have the same dependencies as in pom.xml for MiA examples...

ASHOK AGARWAL said...

Hi Alex, I want to run the Chapter 6 in MiA mapper and reducers but I am not, it will be great if you can help in same.

Alex Ott said...

what kind of problems do you have? All commands that are in book, were checked against hadoop...

Cody O'Donnell said...
This comment has been removed by the author.
Phani said...

I am trying to run the SimpleKMeansClustering from chapter 7 in windows and I get the following error.
Caused by: java.io.IOException: CreateProcess error=2, The system cannot find the file specified
I can run the same program fine in linux. But I am not able to run it in windows.
The error is at line
SequenceFile.Writer writer = new SequenceFile.Writer(fs, conf,
path, Text.class, Cluster.class);
I am guessing it is because of the path. I installed cygwin and also added its bin path to system variables as per some ones suggestion but still has no luck. I am able to run chapter 2 without any problems. Any help is appreciated.

Alex Ott said...

Hi

Sorry, but I can't help you with Windows - I don't have one. And it's not generally recommended to run Hadoop on Windows (at least version 0.22). It's better to grab ready to use VM with Linux and Hadoop from Cloudera and use it.

Umer said...

I'm trying to run this:

public static void main(String[] args) throws Exception {
long startTime,endTime;

startTime = System.currentTimeMillis();
DataModel model = new GroupLensDataModel(new File("ratings.dat"));
endTime = System.currentTimeMillis();
System.out.println("Took " + (endTime - startTime) / 1000 + "seconds to build Datamodel");

RecommenderEvaluator evaluator =
new AverageAbsoluteDifferenceRecommenderEvaluator();

RecommenderBuilder recommenderBuilder = new RecommenderBuilder() {
@Override
public Recommender buildRecommender(DataModel model) throws TasteException {
UserSimilarity similarity = new PearsonCorrelationSimilarity(model);
UserNeighborhood neighborhood =
new NearestNUserNeighborhood(100, similarity, model);
return new GenericUserBasedRecommender(model, neighborhood, similarity);
}
};
startTime = System.currentTimeMillis();
double score = evaluator.evaluate(recommenderBuilder, null, model, 0.95, 0.05);
endTime = System.currentTimeMillis();
System.out.println("Took " + (endTime - startTime) / 1000 + "seconds to evaluate");
System.out.println(score);

but my print statements dont work, any idea why?

sergio gomez said...

hello alex
I need to create a recommender using mahout and a table in oracle that has the columns id number2 (10), title vachar2 (40) and information varchar2 (2000) can help me to create a project that supports this. item-based recommender.

Alex Ott said...

Hi Sergio

I'm currently not involved into any Mahout activity, so it's better to ask in mahout-user mailing list...