Status

So on Muskegon, my playspace for tech stuff, I now have:

  1. Hadoop 3.0.0 alpha4 pseudo-distributed mode using Yarn.
  2. Spark 2.2.0 for Hadoop 2.7.0
  3. TensorFlow
  4. Rstudio
  5. Jupyter

Missing from the stack are:

  1. Hive (think I should wait for Hadoop 3.0.0 stable to be released)
  2. Neo4J

I’d also like to repeat the Hadoop install and then extend it to a cluster-of-2 install.

I’d also like to play with Docker.

I think for now, I’m going to aim at exploring Spark even without the whole large data infrastructure. This way I can get familiar with the API more generally.  2nd priority is using both Rstudio and spark to tackle some of the Kaggle problems and data I’ve espied.

While doing this, I’ll keep an eye on when the 3.0.0 release of Hadoop emerges and look for an opportunity to get Neo4J or a another similar system put up.

 

Hadoop again, and then Spark

I wanted to go down the road of compiling hadoop as a way of installing. Hence, I had all of these extra things like docker which I’d never used (though they look cool), but as I struggled to get it to compile I realized installing hadoop is much simpler. Just download and edit configs. As simple as that is,  I got snagged on a few issues, especially as I went to using YARN. I’m using hadoop 3.0.0-alpha4 (beta has been released since). Anyway, some of the gotchas:

  1. Port number changes (use 9870 not 50070)
  2. The infamous failure to replicate data. Some of this is learning how to parse the hadoop error logs as I went through the testing of adding data and running the grep example.
  3. Learned how to turn off ipv6 (even though it’s not the issue)
  4. How to specify where the disk space is that hadoop is using for its data store (datanode and namenode dir settings)
  5. Adding memory to the jobs
  6. making sure to create directories appropriately.
  7. Adding the whitelist entries for env variables to yarn-site.

When 3.0 becomes an official release, I’ll give it another try. For now I note that I can’t easily add hive for 3.0.0 (unless I have energy to give it a whirl at the compilation level).

Then I’ve added spark. One issue there was that when I ran spark-sh, I got lots of errors that stem from not having an existing datastore, ususally that’s supposed to be something like hive. If you don’t have hive, then it will create a directory ‘metastore_db’ and use it, effectively, as its db. I had somehow created a version of it which stayed up when it tried it again, and it clobbered itself. the solution was to delete the metastore_db directory.

So for now, I have hadoop, I can run hadoop jobs, I have spark and can access it’s toolset.

 

 

Hadoop

Going through the steps to install hadoop in pseudo-standalone mode. The idea is to run on a single server but behave like a cluster. I’ve done it before, maybe 3 years ago. Want to do it again and aim at playing with Apache Spark. Lots of other items need to be installed:

  • docker: was unsure at first as there’s not a typical deb package out there, but it looks like the docker folks have their own packages with good instructions. I had also found similar instructions at the linuxbabe site. Given that they largely agree and no other higher returning search item, I started with linuxbabe’s instructions and then finished off with dockers. (Hers had what seemed to me to be a better set of apt-get prep and checks of OS, etc.). 1st step docker-ce looks good. Neat. I could even use docker to download an ubuntu image and start up a shell within it. Kind of like a VM. It was not clear to me that I had to install docker-composer. In the end, I decided to go ahead. It can be pulled from git-hub and the install directions seem pretty clear.
  • maven
  • node.js pain to install as cruft from instructions at digitalocean/setup_6.x  which creates a script that gets the packages and builds it.
  • libprotobuf9
  • bats
  • cmake
  • zlib
  • java 1.8 (used openjdk-8-jre and jdk which needed ca-certificates-java, accessing jessie-backports then apt-get install -t jessie-backport openjdk-8-jdk, ca-certificates, etc. forcing the use of the backports version). Then run update-alternatives –config java to pick the java one wants.

Now run the start-build-env.sh script within the hadoop package. This relies on docker (and apparently installs docker if needed). Looking at the build.txt file with hadoop, it’s a little murky about which steps are needed and how intensively I need to know and work with maven, or can I mostly cut-and-paste…

 

 

TF

Continuing to play with TensorFlow. The more complex NN I ran based on the TF tutorial/example got to 99.05% accuracy. It’s clear I need a little deeper understanding of the pieces, but it’s coming together a bit. It’s clear that these examples are very dependent on the convolution kernels. These provide the numeric input to the nodes in each layer. Crossing the results of the convolution across the image gives a global sense of what the kernel says about the image. So, for example, if a kernel is sensitive to finding a dot surrounded by blank pixels, and the image consists of a bunch of dots, then I’d expect equal weights being given to all pixels to correctly label it. Each of the kernels corresponds to a different feature being assessed across the image. The weights also get at where in the image the feature from a kernel is found.

My plan is to go onto the next piece of the TF tutorial, focused on mechanics.

Thinking ahead to applying this to graph data.

More Sophisticated TF Example

Started last weekend, continuing now. This example was a 2 layer ReLU.

Took another day to finish it. The final construct is still running (20,000 iterations, on iteration 3200 after 10′ maybe, so maybe another hour or so), but it’s performing as advertised. The deeper example forced me to think about some of the constructs and additional topics such as pooling and dropout. Not sure what drives the dimensionality of the outputs from each layer. This example used 2 layers. The first one creates 32 features for a 14 x 14 image. I think I’m interpreting that right. It actually creates 32 for the full 28 x 28 image, but pooling picks the maximum value for each 2 x 2 sub-array. Then it creates 64 features on a 7 x 7 array from the first array’s output. Then it creates 1024 features from a the last array which are then used to create a 10 feature readout layer. I admit to confusion about the role of the densely connected layer (the one which creates the 1024 final features). But mechanically, the package works very nicely.

 

TensorFlow

Plan is to continue playing with TensorFlow today. But also one more backfilling note: I’ve also gained access to Kaggle, GitHub, and started making more disciplined use of Google+. Found an interesting fraud (I think transactional) dataset that another person was playing with per Kaggle. (https://www.kaggle.com/dalpozz/creditcardfraud).

Another interesting site was Jesus Barrasa’s WordPress site (“Graph-backed thoughts”) relevant to Neo4J efforts. Speaking of Neo4J, there was a nice post on RNeo4j and visNetwork on Sep 2 from the Neo4j blog:

https://neo4j.com/blog/visualize-graph-with-rneo4j/

When am I going to get around to Apache Spark? Well for now, on to the MNIST TensorFlow tutorial. Maybe PySpark a little later today.

I’m keeping another notebook (TensorFlow with MNIST) to track my efforts to follow along.

TensorFlow

I’ve installed TensorFlow with python and python3 (I am intimate with python2, not so much with python3 but this gives me a little motivation to use 3). A little fumbling around with install package. Had to un-install backports.weakref and then install it to get a version with a needed function for the python2 version.  Then started going through the tutorial and getting started pieces. Those are pretty well written at TensorFlow’s website.  Basically kept record of their notebook in my own Jupyter notebook (TensorFlow Basics). Went though building and testing a linear regression. Then jumped ahead to check that their TensorBoard functionality is working. Work with  TensorFlow being kept in ~/TensorFlowSandbox

Other notes (backfilling here):

Installed Jupyter which I start at commandline as root by “jupyter notebook”

Installed R and RStudio. Historically, I’m more comfortable with CLI interfaces and have shied away from RStudio, but am giving it a chance. Used the opportunity to play with RMarkdown (actually kind of nice) and therein documented playing with their neural net package. RMarkdown allows you to ‘publish’ their notebooks either to RStudioConnect ($) or to Rpubs (Free). So now I have a site (http://rpubs.com/stevevejcik) to keep my RMarkdown notebooks. To let RStudio connect to RStudioConnect/RPubs, required libssl-dev (via apt-get) and PKI (via R) and then reconnect.

Also playing with Shiny. Created a Shiny App and uploaded to https://www.shinyapps.io/.