Getting started in standalone mode

This tutorial will get you started with Cloud9 in standalone mode. In standalone mode, you run Hadoop directly on your local machine. Of course, you don't get the benefit of distributing your code across multiple machines... but it's a good start for learning about Hadoop. This tutorial assumes you've already downloaded the libraries and gotten it set up. Also, see companion tutorial on getting started with on EC2.

For Windows users: If you are using Windows, use Cygwin. That's what I mean when I say, "open up a shell".

Step 1: Configure Hadoop for standalone mode

This tutorial assumes Hadoop 0.20.1. Make sure you've downloaded and unpacked the Hadoop distribution somewhere. Open up a shell and go to /path/to/hadoop/conf/. Make sure the file core-site.xml doesn't actually specify configuration parameters:

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<!-- Put site-specific property overrides in this file. -->

<configuration>

</configuration>

Verify the same for hdfs-site.xml and mapred-site.xml. This should be the case for a clean distribution. This configuration ensures that your Hadoop now runs in standalone mode.

Later on you will actually specify configuration parameters here to connect to a cluster. In that case, you can override those parameters and force standalone mode from the command line. Like this:

hadoop fs -D mapred.job.tracker=local -D fs.default.name=file:/// -ls .

The above example performs a directory listing in standalone mode (which corresponds to a directory listing of the local disk).

Step 2: Run pi

Open a shell and go to /path/to/hadoop/. Now run the pi demo:

$ bin/hadoop jar hadoop-0.20.1-examples.jar pi 10 100
Number of Maps  = 10
Samples per Map = 100
Wrote input for Map #0
Wrote input for Map #1
Wrote input for Map #2
Wrote input for Map #3
Wrote input for Map #4
Wrote input for Map #5
Wrote input for Map #6
Wrote input for Map #7
Wrote input for Map #8
Wrote input for Map #9
Starting Job
[...]
09/11/18 19:51:57 INFO mapred.JobClient: Job complete: job_local_0001
09/11/18 19:51:57 INFO mapred.JobClient: Counters: 13
09/11/18 19:51:57 INFO mapred.JobClient:   FileSystemCounters
09/11/18 19:51:57 INFO mapred.JobClient:     FILE_BYTES_READ=1725357
09/11/18 19:51:57 INFO mapred.JobClient:     FILE_BYTES_WRITTEN=1926195
09/11/18 19:51:57 INFO mapred.JobClient:   Map-Reduce Framework
09/11/18 19:51:57 INFO mapred.JobClient:     Reduce input groups=20
09/11/18 19:51:57 INFO mapred.JobClient:     Combine output records=0
09/11/18 19:51:57 INFO mapred.JobClient:     Map input records=10
09/11/18 19:51:57 INFO mapred.JobClient:     Reduce shuffle bytes=0
09/11/18 19:51:57 INFO mapred.JobClient:     Reduce output records=0
09/11/18 19:51:57 INFO mapred.JobClient:     Spilled Records=40
09/11/18 19:51:57 INFO mapred.JobClient:     Map output bytes=180
09/11/18 19:51:57 INFO mapred.JobClient:     Map input bytes=240
09/11/18 19:51:57 INFO mapred.JobClient:     Combine input records=0
09/11/18 19:51:57 INFO mapred.JobClient:     Map output records=20
09/11/18 19:51:57 INFO mapred.JobClient:     Reduce input records=20
Job Finished in 2.625 seconds
Estimated value of Pi is 3.14800000000000000000

Okay, so the value of pi is a bit off... but at least Hadoop works!

Step 3: Unpack some data and build the job jar

Now we're getting ready to run the word count demo. Open a shell and go to Cloud9/data/. Uncompress the sample text collection (Bible and the complete works of Shakespeare):

$ gunzip bible+shakes.nopunc.gz

Now let's build a job jar for running the word count demo. Open a shell and go to Cloud9/. Build the library using Ant with the simple command:

$ ant

You should now see cloud9.jar in your current directory.

Step 4: Build and run the word count demo

Once you have created the jar with ant, you should be able to run the word count demo is standalone mode. Run the class to find out its command-line arguments:

$ hadoop jar cloud9.jar edu.umd.cloud9.example.simple.DemoWordCount
usage: [input-path] [output-path] [num-reducers]

Now run the code with on the sample text collection:

$ hadoop jar cloud9.jar edu.umd.cloud9.example.simple.DemoWordCount data/bible+shakes.nopunc wc 1
10/07/11 22:25:42 INFO simple.DemoWordCount: Tool: DemoWordCount
10/07/11 22:25:42 INFO simple.DemoWordCount:  - input path: data/bible+shakes.nopunc
10/07/11 22:25:42 INFO simple.DemoWordCount:  - output path: wc
10/07/11 22:25:42 INFO simple.DemoWordCount:  - number of reducers: 1
[...]
10/07/11 22:25:48 INFO mapred.JobClient: Counters: 12
10/07/11 22:25:48 INFO mapred.JobClient:   FileSystemCounters
10/07/11 22:25:48 INFO mapred.JobClient:     FILE_BYTES_READ=22907000
10/07/11 22:25:48 INFO mapred.JobClient:     FILE_BYTES_WRITTEN=5867160
10/07/11 22:25:48 INFO mapred.JobClient:   Map-Reduce Framework
10/07/11 22:25:48 INFO mapred.JobClient:     Reduce input groups=41788
10/07/11 22:25:48 INFO mapred.JobClient:     Combine output records=128253
10/07/11 22:25:48 INFO mapred.JobClient:     Map input records=156215
10/07/11 22:25:48 INFO mapred.JobClient:     Reduce shuffle bytes=0
10/07/11 22:25:48 INFO mapred.JobClient:     Reduce output records=41788
10/07/11 22:25:48 INFO mapred.JobClient:     Spilled Records=170041
10/07/11 22:25:48 INFO mapred.JobClient:     Map output bytes=15919397
10/07/11 22:25:48 INFO mapred.JobClient:     Combine input records=1820763
10/07/11 22:25:48 INFO mapred.JobClient:     Map output records=1734298
10/07/11 22:25:48 INFO mapred.JobClient:     Reduce input records=41788
10/07/11 22:25:48 INFO simple.DemoWordCount: Job Finished in 5.345 seconds

There should now be a new sub-directory in your current directory called wc/ that contains the output of the word count demo:

$ head wc/part-r-00000
&c      70
&c'     1
''all   1
''among 1
''and   1
''but   1
''how   1
''lo    2
''look  1
''my    1

$ tail wc/part-r-00000
zorites 1
zorobabel       3
zounds  20
zuar    5
zuph    3
zur     5
zuriel  1
zurishaddai     5
zuzims  1
zwaggered       1

$ wc wc/part-r-00000 
   41788   83576  447180 wc/part-r-00000

And that's it! Now you're ready to run a real MapReduce cluster.