Hadoop again, and then Spark

I wanted to go down the road of compiling hadoop as a way of installing. Hence, I had all of these extra things like docker which I’d never used (though they look cool), but as I struggled to get it to compile I realized installing hadoop is much simpler. Just download and edit configs. As simple as that is,  I got snagged on a few issues, especially as I went to using YARN. I’m using hadoop 3.0.0-alpha4 (beta has been released since). Anyway, some of the gotchas:

  1. Port number changes (use 9870 not 50070)
  2. The infamous failure to replicate data. Some of this is learning how to parse the hadoop error logs as I went through the testing of adding data and running the grep example.
  3. Learned how to turn off ipv6 (even though it’s not the issue)
  4. How to specify where the disk space is that hadoop is using for its data store (datanode and namenode dir settings)
  5. Adding memory to the jobs
  6. making sure to create directories appropriately.
  7. Adding the whitelist entries for env variables to yarn-site.

When 3.0 becomes an official release, I’ll give it another try. For now I note that I can’t easily add hive for 3.0.0 (unless I have energy to give it a whirl at the compilation level).

Then I’ve added spark. One issue there was that when I ran spark-sh, I got lots of errors that stem from not having an existing datastore, ususally that’s supposed to be something like hive. If you don’t have hive, then it will create a directory ‘metastore_db’ and use it, effectively, as its db. I had somehow created a version of it which stayed up when it tried it again, and it clobbered itself. the solution was to delete the metastore_db directory.

So for now, I have hadoop, I can run hadoop jobs, I have spark and can access it’s toolset.

 

 

Leave a comment