So on Muskegon, my playspace for tech stuff, I now have:
- Hadoop 3.0.0 alpha4 pseudo-distributed mode using Yarn.
- Spark 2.2.0 for Hadoop 2.7.0
- TensorFlow
- Rstudio
- Jupyter
Missing from the stack are:
- Hive (think I should wait for Hadoop 3.0.0 stable to be released)
- Neo4J
I’d also like to repeat the Hadoop install and then extend it to a cluster-of-2 install.
I’d also like to play with Docker.
I think for now, I’m going to aim at exploring Spark even without the whole large data infrastructure. This way I can get familiar with the API more generally. 2nd priority is using both Rstudio and spark to tackle some of the Kaggle problems and data I’ve espied.
While doing this, I’ll keep an eye on when the 3.0.0 release of Hadoop emerges and look for an opportunity to get Neo4J or a another similar system put up.
