virtualenv

virtualenv is advertised as another type of cordoned off environment for python. Similar to anaconda environments.

commands:

  1. virtualenv
    1. Creates dir
  2. virtualenv –clear
  3. virtualenv -p python3
  4. virtualenv testvirtualenv
  5. virtualenv –system-site-packages r-tensorflow
    1. creates directory r-tensorflow with virtualenv. The python in this environment will have access to the system python packages.
  6. virtualenv –system-site-packages mytensorflow
  7. virtualenv –no-pip
  8. virtualenv –extra-search-dir=SEARCH_DIRS
    1. directory to use to look for pip/setuptools/distribute
  9. activate
  10. source testvirtualenv/bin/activate

Maintaining Ancient Machines

I needed to move one of my old machines (an HP Laptop) to a new hard drive. And although there were hiccups, it went smoothly enough that I did not have to re-install the OS. This is a machine that I do not have dedicated to any particular role at the moment. It’s come in handy when I’ve had work on other machines that needed to be brought down or had to come off of the network. A few months ago I started getting notices that a hard drive failure was imminent. So, here are the steps I took:

  1. use dd to directly copy everything from current drive to new drive. (Took a day).
  2. use uuidgen to get new uuid values for the partitions on the new drive.
  3. use tune2fs -U uid1 /dev/sdaX to set the uuid value for partition X – some details:
    • Note that this failed at first because there were errors in sda1. I had to run e2fsck /dev/sda1. I liberally used lsuid and fsck -l  to check that the new partitions were cool.
    • There was some confusion at first as the original drive was still connected. So I had to play games with forcing the laptop to boot from the USB-mounted new drive; that was fine but I did this when I had not yet changed any UUID which made things confusing as you had no way of knowing whose disk was actually mounted.
    • I almost started, actually I did start, down the rabbit hole of trying to understand how to mount two bootable disks with linux OS’s. This would be the equivalent of a dual boot system which I used to do back in the day with Linux + Windoze. Then good sense came to me and I remembered I didn’t want the original disk.
    • I concluded that the only change I should need to make would be to grub.cfg and to /etc/fstab to reflect the new UUID values. I did this from within gparted live. I also think I performed the tune2fs from the same system – I don’t recall if I had already removed the old hard drive.  I like gparted but a couple odd tricks:
      • running e2fsck is done as sudo e2fsck /dev/sdc1 for example.
      • Instead of tune2fs for the swap partition (which is not an ext2/ext3/ext4 fs), you literally recreate the swap space: mkswap /dev/ddb
      • Then become root on gparted live (use sudo su)
      • mount /dev/sda1 /mnt
      • edit /mnt/boot/grub/.cfg
      • edit /mnt/etc/fstab
      • use blkid to find the relevant UUID values.
  4. In the end, it worked like a charm. Some scary moments when I thought I would clobber myself, but it seemed good. Also, I had to install uuidgen on gparted, and I had to do an update to tune2fs (or maybe it was e2fsck) on gparted.

All in all, some good practice.

 

xonsh and shell selections

Acting on one of a billion newstips from co-workers, I’m kicking the tires on Xonsh. It is an attempt to weld python onto a shell so you get a lot of pythonic computational goodness as though you were in a python interpreter, but it also gives you a  more shell-like insight into the OS. Hard to tell my thoughts yet -feels like a reasonable, clever extension of bash but with some thought having been given to not clobber itself. For example, bash would see $X and ${X} the same way. Xonsh uses the redundancy to use the ${X} mode for an extension that leverages Python’s dict notation but retains the basic regard as an env. variable. Using ${X} in xonsh implies that X can be anything expressive of a string which will subsequently be a varible. For instance, ${a + b} where a=’PYTHON’ and b=’PATH’ will return $PYTHONPATH. This could be useful in iterating over files in a programmatic way. It may seem trivial but I do like the ability to essentially have a full command like calculator at my fingers without entering a python or some other interpreter.  I’m still working through tutorial so more thoughts later. Oh, also, the !{command} is also nice: it returns a class object with lots of details about the command which is run and itself can be directly evaluated as a truth. So you can do stuff like if !{‘ls mybooks} which proceeds if ‘mybooks’ exists. Compare to the $? use in bash.

 

Success at installing TensforFlow into R-studio

So I hit the right github thread (https://github.com/rstudio/keras/issues/434#issuecomment-403200040) that led to being able to install TF in Rstudio. The issue had been around the decision for pip maintainters to no longer expose the main() function. When installing keras or TF in Rstudio aimed at a virtualenv, it automatically downloads the latest version of pip and uses that to install the python packages (TF/Keras in this case).  A reader provided a zip file with modified code for ‘utils.R’ and ‘install.R’ which get sourced prior to invoking install_tensorflow. The author also indicated that you need to specifically use TF 1.5 (current is 1.6 I think). The new install also accesses keras as an option to the install_tensorflow command rather than get created from a separate install_keras command.

What’s less clear to me is why this wouldn’t be incorporated into the keras or tensorflow packages in R? They closed the issue but I saw no evidence that it was used as part of the new package?

I did run into one other item which was interesting. Once the above problem was addressed, I was failing due to lack of space. It turns out that install_tensorflow is using temp space to download needed packages and it is the temp directory which was running out of space. To get around that, I created an .Renviron file which specified TMPDIR to a directory in my home area (which I created).  R command ‘tempdir()’ confirms (after restarting R-session) that this is the appropriate temp directory and it all worked thereafter. On accessing Session, I did get this cryptic statement though…

2019-01-20 12:07:11.999849: I tensorflow/core/platform/cpu_feature_guard.cc:137] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.1

Plan for a winter weekend

So I want to make sure I understand what the failure was for keras and tensorflow on R again. Then I have options – I think I’ll get back to 2 items: 1) the earthquake/volcano project meant to exercise various things and 2) the salt image classification kaggle challenge. I also want to re-assess disk usage on my main (single) box as I’ve had issues with disk space (oddly, ‘var’). I did get a new disk last year but never bothered to install…

Here’s the steps I’m using.

library(keras)
use_virtualenv(‘testvirtualenv’)
install_keras(‘virtualenv’)
help(‘virtualenv-tools’)
library(‘reticulate’)
virtualenv_list()
virtualenv_root()
virtualenv_remove()

The issue is that (as I recall) at step 3 (install_keras(‘virtualenv’)), Rstudio is going out and re-installing pip in the virtualenv. This gets the latest version of pip which does not export function main() any longer.

Along the way, I’m using this as an opportunity to play with virtualenv and reticulate.

 

 

A bit of a roadblock with TensorFlow and Keras

Motivated by some of the places I land, I decided to play with using TF and Keras from R. I’ve already walked through the python tutorials but want to stretch a bit. This brought me to the land of python-R interface and virtual env and reticulate. The TF and keras packages in R seem to function by accessing python either by an assumed anaconda installation or through virtualenv.  I don’t want to install anaconda at this point so I went the virtualenv route which was interesting. There is a problem however. the TensorFlow R-package always creates its own virtualenv, I’ve had no success getting it to use one I pre-set. Also, it always installs the latest version of pip. It does this so that *it* can control the python modules brought into the virtualenv. However, it wants to directly invoke the __init__ (I think that was the function) function from pip, as in “from pip import __init__”. That function is no longer exposed in the latest pip release. Bummer, at the moment. I’d like to try looking at the R code and see if I can prevent it from always installing the latest version of pip. That could solve it.

 

I also want to go back to 2 unfinished projects: I’ve built two web-scraping apps in python to get volcano and earthquake data. Thought it would be an interesting dataset to test some deep learning software on. Very unfinished but I did put them up into Git as well.

Another project came via a Kaggle competition to classify geological (salt) content in images (rasterized). I have some ideas there – not in time to compete in competition but I feel its a good playground.

Then I want to go back to installing hadoop and spark on the local machines here.

 

Playing with Keras via Rstudio

Spent a day or two this weekend working through a creditcard fraud example to give a go at running Keras via r-studio. Along the way I dug into virtualenv which I hadn’t used before.

The example I was working with came from here:

https://tensorflow.rstudio.com/blog/keras-autoencoder.html

The same site had some other blogs that look possibly interesting. In the example, you use a credit-card fraud data set (I think transactional). From a distance, the steps are pretty easy and I’ve documented them via r-markdown. You first normalize the data so they all run from 0 – 1. That got me acquainted with purrr and it’s map functions. The biggest challenge ended up being linking keras and tensorflow to R-studio. These are really python packages and R offers a way to link to python packages. However, you need to be using anaconda or virtualenv. Although I use anaconda at work quite a bit, I’m not eager to go to it at home just yet. I prefer to be more of a micromanager at this point just because it helps me learn things at a deeper level. But I hadn’t heard of virtualenv. It appears to be similar to the env functionality of anaconda. You can create a ‘local’ (a.k.a. ‘virtual’) environment with a standalone python binary and whatever packages you want. It comes with it’s own pip installer as well. So I installed it,

apt-get install python-virtualenv

You create a virtual environment as follows:

virtualenv testvirtualenv

or

virtualenv –system-site-pkgs testvirtualenv

The 2nd for lets the virtualenv access the system python packages as well.

Now in Rstudio, you do

install_keras(virtualenv)

Problem of the day now occurs: Rstudio proceeds to create its own virtual environment and then install tensorflow and keras into it. During installation, it loads pip via an import statement and talks to pip’s Main function. However, as of pip 10.0, there is no longer a Main function and it fails.  Solution: do the install via pip in my own virtual env then tell R which virtual env to use:

virtualenv testvirtualenv

source testvirtualenv/bin/activate.sh

easy_install -U pip

pip install –upgrade tensorflow

use_virtualenv(‘/home/steve/myvirtualenv’)

I can’t remember if I also installed keras there.

Afterwards, the rest of the keras autoencoder example went reasonably well – after accounting for some rookie mistakes on my part, some of which appeared bad until after you figure out what was wrong. For example, in normalizing the data I had produced a dataset with no data. I hadn’t realized it until much later when creating the model. Need to check that things go as expected more aggressively. purrr offers a million map function which is what I’m using for normalization and you have to be a little careful about which one to use, not just one that doesn’t cause an error.

A section at the end of the tutorial also talks about using CloudML and Google’s cloud infrastructure to support tuning of ‘hyperparameters’ (which drive the training of the net – e.g. # of epochs). I hadn’t approached google’s cloud infrastructure yet but (not surprisingly), it looks pretty neat too.

Still couldn’t get a sensible AUC  at end though.

 

 

Getting Data and Webscraping

Getting back to semi-regular playing with different analytic and data tools out there. This weekend I spent time using python (3!) to build some data sets and start playing with some python tools for data handling and modeling (Pandas, scikit-learn). This was inspired by a link I’ve had up on my browser for a while on web-scraping to get data. So I started by looking for volcanoes. (is that the right pluralization for volcano?). I actually started with the python2 version and went through using urllib and BeautifulSoup and got them to play with a nice site from Oregon listing volcanoes.

Full disclosure: motivating post is

http://blog.kaggle.com/2017/01/31/scraping-for-craft-beers-a-dataset-creation-tutorial/

Volcanoes:

http://volcano.orgegonstate.edu

This went pretty well – got data and had it in a Pandas dataframe. Found nice docs on pandas, their ‘cookbook’, so started figuring out how to work with their dataframe. Pretty similar to R dataframes. After getting the data into that format, started exploring what exactly Pandas could get me (I’m a Python fanatic but never really needed Pandas oddly). That edged me toward playing with scikit-learn and reminded me to go look at these ‘regression’ methods that somehow never came up in HEP – thing like ridge and lasso. They just seem so…’non-fundamental’. Sort of like the deep need in physics to tie a parameter – especially one from a fit – to a physical entity. So the need to rely on ‘imposing a penalty’ seems like creating a new game that at a deep level is not reality. Physicists are used to dealing with the idea of ‘truth’, not approximations (pace ansatz).

I did hit on one place where I can see how the fork in the road between science and data science kind of comes into play. With data science, people are using data in any way they can, not really trying to understand it. Thus, getting into trouble with things like collinearities, is a hurdle. A physicist would want to understand and spend time figuring out the collinearity and then do the appropriate fit, basically orthoganalizing the system. Data scientists figure out ways to get a fit that ‘works’. So the idea of a cost – really a kluge- comes in to play, because the end-game is a good/predictive fit, not understanding.

Then thought that to get something interesting, wanted to bump this against some other data so went looking for earthquake data. Seem related.

I found the USGS data for earthquakes. First attempt to just imitate what I did with the volcano website failed miserably. Spent a day or more trying to figure it out. I could use urllib and BeatifulSoup to access the website, but what I wanted was to access the results of a search button on website. Because I didn’t understand BS well enough, I spent time thinking that it was the source of my troubles. Lots of pointless googling led me to understand that it was the javascript embedded features (I think, still  not completely sure) which posed too much of a challenge for the combinations of libraries I was using. My focus went to urllib (good, but not enough) and then to a package called selenium which did great. Selenium lets you interact with the browser right within python. Literally – as in, its use brings up a real browser when invoked. So I culminated things with finding how to run a ‘headless’ version of Chrome.   In the process I found a company really dedicated to scraping web data for customers. Neat idea.

So now I have two scripts to get volcano and earthquake data. The link between the two is location (lat/long). Am hoping to use this as a basis for playing more with pandas and scikit-learn.

 

 

 

I will have one more try to use PySpark with Yarn.

spark-submit –master yarn test.py

vs

spark-submit –master local test.py

If I submit with yarn but specify local within the code, it works fine.

If I submit with master=local and specify local with the code it works fine.

If I submit master=yarn and yarn in code, it failed, this time with an out of space error:

diagnostics: Application application_1510537474959_0005 failed 2 times due to AM Container for appattempt_1510537474959_0005_000002 exited with exitCode: -1000
Failing this attempt.Diagnostics: [2017-11-24 20:02:13.899]No space left on device
java.io.IOException: No space left on device

at java.io.FileOutputStream.writeBytes(Native Method)
at java.io.FileOutputStream.write(FileOutputStream.java:326)
at org.apache.hadoop.fs.FileUtil.unZip(FileUtil.java:608)
at org.apache.hadoop.yarn.util.FSDownload.unpack(FSDownload.java:279)
at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:364)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ContainerLocalizer$FSDownloadWrapper.doDownloadCall(ContainerLocalizer.java:235)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ContainerL

 

Now if I run with local as master and yarn in code, I also get the following info about name node.

org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.SafeModeException): Cannot delete /user/steve/.sparkStaging/application_1510537474959_0006. Name node is in safe mode.
Resources are low on NN. Please add or free up more resourcesthen turn off safe mode manually. NOTE: If you turn off safe mode before adding resources, the NN will immediately return to safe mode. Use “hdfs dfsadmin -safemode leave” to turn safe mode off. NamenodeHostName:localhost

 

But loca/local still works…

Now I run with spark in submit command (local still in code) but this time specify executor memory and number of executors and it works.

 

Staying with Spark and Python

Having put in a version of Spark (albeit aimed at hadoop 2.7 not 3.0) I want to continue playing with it and leveraging Pyspark. I’m thinking this could be a way of upping my familiarity with more of the ML pieces of python and pandas (never used it much). So some quick findings from yesterdays playing:

You can import the spark python api directly into python – but I don’t quite have the enviornment set up correctly yet. I think I can, but it was simpler to use the section of the getting started with spark that dealt with running pyspark as an application. That way you use the spark-submit script which handles that for you. the downside to that was understanding some aspects of the tutorial which were (I think) sort of sloppily laid in there. When you use spark-submit, you specify the master, as in with respect to the cluster. Now I have hadoop set up pseudo-distributed with yarn as a resource manager. I can use master=local or master=yarn. Within the tutorial, they left it as ‘master’ with an implication that master is a variable set up somewhere, not that I was to replace it with one of the above. So that was one mystery to solve.

Mystery two was/is that when I start my hadoop cluster it still goes to ‘safe mode’ which I have to manually leave. I thought I recall that had to do with having a heartbeat of some kind having been received. Have to follow up with that later.

 

Mystery 3, I have to export SPARK_LOCAL_IP=127.0.0.1. There’s something still weird going on with my internet config where muskegon.com is pointed to a different address and I’m unsure what the relationship is between the local network (192.168.3…) and the local static address (127.0.0.1 or sometimes 127.0.1.1). Have to follow up with that later too.

So I could run a basic set of operations (I uploaded a fraud csv file found on kaggle- I think- to my hadoop input area). That was also something that should’ve been obvious but I didn’t catch right away. The filesystem defaults to my hadoop env. I got to the point where I was able to get the csv file into a DataFrame. I then wasted time thinking that there was some relationship between Spark DataFrame and Pandas DataFrame. There really isn’t one. But in figuring this out, I ran across a neat project called “Sparkling Pandas” which was neat and had a youtube video that helped me get a little better acquainted with how the whole environment works.  They aren’t active any more and suggest people look at Ibis or the now-better-developed Spark DataFrame. Also of interest is continuum.io/blog/blaze project. The Sparkling Pandas duo talked about Spark providing a ‘Directed A-cyclic Graph Operation’ Framework which makes more sense thinking about the TensorFlow pacakge and even the relationship to the graph db and theory stuff I’ve been dabbling in for the past year. Key point: use spark to parallelize your operations. sc.parallelize(…).

Anway, for now, I’m going to dive into the Spark API for a bit. Then back to Pandas (like to play with its time series module). Eventually want to look at pandas.mllib, scikit/learn, etc.