I wanted to go down the road of compiling hadoop as a way of installing. Hence, I had all of these extra things like docker which I’d never used (though they look cool), but as I struggled to get it to compile I realized installing hadoop is much simpler. Just download and edit configs. As simple as that is, I got snagged on a few issues, especially as I went to using YARN. I’m using hadoop 3.0.0-alpha4 (beta has been released since). Anyway, some of the gotchas:
- Port number changes (use 9870 not 50070)
- The infamous failure to replicate data. Some of this is learning how to parse the hadoop error logs as I went through the testing of adding data and running the grep example.
- Learned how to turn off ipv6 (even though it’s not the issue)
- How to specify where the disk space is that hadoop is using for its data store (datanode and namenode dir settings)
- Adding memory to the jobs
- making sure to create directories appropriately.
- Adding the whitelist entries for env variables to yarn-site.
When 3.0 becomes an official release, I’ll give it another try. For now I note that I can’t easily add hive for 3.0.0 (unless I have energy to give it a whirl at the compilation level).
Then I’ve added spark. One issue there was that when I ran spark-sh, I got lots of errors that stem from not having an existing datastore, ususally that’s supposed to be something like hive. If you don’t have hive, then it will create a directory ‘metastore_db’ and use it, effectively, as its db. I had somehow created a version of it which stayed up when it tried it again, and it clobbered itself. the solution was to delete the metastore_db directory.
So for now, I have hadoop, I can run hadoop jobs, I have spark and can access it’s toolset.
