Staying with Spark and Python

Having put in a version of Spark (albeit aimed at hadoop 2.7 not 3.0) I want to continue playing with it and leveraging Pyspark. I’m thinking this could be a way of upping my familiarity with more of the ML pieces of python and pandas (never used it much). So some quick findings from yesterdays playing:

You can import the spark python api directly into python – but I don’t quite have the enviornment set up correctly yet. I think I can, but it was simpler to use the section of the getting started with spark that dealt with running pyspark as an application. That way you use the spark-submit script which handles that for you. the downside to that was understanding some aspects of the tutorial which were (I think) sort of sloppily laid in there. When you use spark-submit, you specify the master, as in with respect to the cluster. Now I have hadoop set up pseudo-distributed with yarn as a resource manager. I can use master=local or master=yarn. Within the tutorial, they left it as ‘master’ with an implication that master is a variable set up somewhere, not that I was to replace it with one of the above. So that was one mystery to solve.

Mystery two was/is that when I start my hadoop cluster it still goes to ‘safe mode’ which I have to manually leave. I thought I recall that had to do with having a heartbeat of some kind having been received. Have to follow up with that later.

 

Mystery 3, I have to export SPARK_LOCAL_IP=127.0.0.1. There’s something still weird going on with my internet config where muskegon.com is pointed to a different address and I’m unsure what the relationship is between the local network (192.168.3…) and the local static address (127.0.0.1 or sometimes 127.0.1.1). Have to follow up with that later too.

So I could run a basic set of operations (I uploaded a fraud csv file found on kaggle- I think- to my hadoop input area). That was also something that should’ve been obvious but I didn’t catch right away. The filesystem defaults to my hadoop env. I got to the point where I was able to get the csv file into a DataFrame. I then wasted time thinking that there was some relationship between Spark DataFrame and Pandas DataFrame. There really isn’t one. But in figuring this out, I ran across a neat project called “Sparkling Pandas” which was neat and had a youtube video that helped me get a little better acquainted with how the whole environment works.  They aren’t active any more and suggest people look at Ibis or the now-better-developed Spark DataFrame. Also of interest is continuum.io/blog/blaze project. The Sparkling Pandas duo talked about Spark providing a ‘Directed A-cyclic Graph Operation’ Framework which makes more sense thinking about the TensorFlow pacakge and even the relationship to the graph db and theory stuff I’ve been dabbling in for the past year. Key point: use spark to parallelize your operations. sc.parallelize(…).

Anway, for now, I’m going to dive into the Spark API for a bit. Then back to Pandas (like to play with its time series module). Eventually want to look at pandas.mllib, scikit/learn, etc.