Seq2Seq

With all the attention getting paid to ChatGPT, it’s drawn my attention to some NN and related topics and my own interest in playing with them, sometimes thinking about topics connected to work, sometimes not. Recent chats with people led me back to thinking about transformers and LSTM and the like and finally landed me on some material on RNN techniques and Seq2Seq in particular. In the back of my head I think about playing out certain chronologies of a life (address history for instance) and wondering if things like this would be useful to, say, predict where people will live next. So back to basic Seq2Seq, this post is just a quick review of a Medium Post I came across on the topic:

Encoder-Decoder Seq2Seq Models, Clearly Explained!!

That’s the title of the Medium article by Kriz Moses. It really was very nicely written and he has some useful links in there, in particular to an Arxiv preprint from 3 Google associates who wrote a seminal paper on the topic. This article, like the Arxiv one, is built around understanding Seq2Seq in their use for translating from language to language (English to French for their example). The article nicely de-mystifies the whole thing. There are a couple minor details that I’m not sure about and one semi-major aspect that was never articulated but is, I think, pretty important – assuming I’m correct about it. At the end, there was an interesting variant they pointed to where you can use these models to do image-to-caption ‘translation’ too.

The key to these models is that they all talk about model-based transformations of sequences as a set of connected LSTMs and they say these are ‘equivalent to unrolling an RNN’, but I think they need to make this point stronger since it was the only way I could get around how they handled variable length sequences. That is, an RNN is a ‘state-preserving’ NN. What it will predict next depends on it’s current state which preserves a memory of what it has done before. Assuming some sort of ‘framing’ device for a sequence, then a set of records presented can be seen as a sequence. The state-preserving sense of an RNN means that the stream of handling records within a sequence by the RNN can be expressed as equivalent to a sequence of NN (they use LSTM). With time going from left to right, the first token enters the LSTM at the left, the second token enters the next LSTM moving to the right. That’s the same as the RNN: token 1 enters the RNN, the state is prserved and presented to the same RNN at the time the 2nd token enters, and so on. Paying attention to what demarcates the start and stop of a sequence is key. I’m not completely certain, but I think either the start or stop probably needs to be seen as equivalent to a ‘reset’ button (though if you want to connect sets of sequences, that’s probably not right).

As I said above, there were a couple minor technical points where the article felt either in error or I missed something. In particular, designating the ‘true’ and ‘predicted’ outputs of the RNNs (there are two in this type of model, one referred to as the ‘encoder’ and one the ‘decoder’). During training, the loss function is comparing the ‘true’ sequence produced by the encoder with the ‘predicted’ sequence and I think the author mangled some of the specific content.

I’m quite eager to try this out because I think there are many sequenced things beyond literal sentences that would be of interest here (curious what it would make of irrational numbers like pi.). This also motivated me to look at wor2vec because the seq2seq framework he demonstrated relies on it or something similar as a preprocessing step to handle the input. The original input is a (presumably) massive vector of 1-hot encodings of all possible words you want considered.

Google Cloud Platform Exploration Day 1

Circumstances and personal interest have led me to explore Google Cloud Platform. I want to document that experience here. I started with the free trial and then paid up for a subscription before I did anything.

First hour: I had signed up for the free trial and of course it timed out before I could do anything than start up a VM. So I went ahead and bit and signed up for the $$, keeping my instance. GCP has a console from which you can launch things. I started with cloud-shell which apparently is its own small VM. It comes with the gcloud package of tools. I could do things like see my instance and connect to my compute instance. One gotcha- I restarted the cloud-shell (really, re-initialized it) and doing that, it asks you what ‘zone’ you want to use. I chose one of 3 in US Central. When I saw my instance from within cloud-shell, I struggled to get details or connect to it from cloud-shell. Apparently the issue is/was that the VM was started in a different zone and you need to specify that when using gcloud.

Now inside my VM (vanilla debian). Nothing pre-installed but python3 (no symlink to python). I was able to install jupyter using apt-get though and started up in console mode. I’ll need to recall things like how to know the right URL to see a notebook up and going.

Graphs Anew

I recently went through an intense 3 month ‘spike’ in playing with graph data frameworks, specifically looking at Neo4J, SparkX, and igraph with regards to a multi-dimensional benchmarking effort. I’ll have a lot more to say about it, but I wanted to get out there that Neo4J has really come a long way in the past several years – it’s maturity, performance, capability, and stability all put it ahead of the others and for now looks like a real tool instead of a visualization engine for marketing. Tip of my hat to them.

Analytics in a Big Place and a Small Place

I’m old enough that I’ve done ‘analytics’ in many venues and in many contexts. The first half of my life was in academia doing research. Within that context, the first half was essentially FORTRAN to the max with most innovation coming in the form of the hardware supporting it – mainframes to x-boxes to PC’s. The second half of my life was in essentially commercial venues and that started in a period of significant changes to computing paradigms and software: hello to the web, PHP, R, friendlier service-based paradigms, hadoop and on to todays world of cloud-based platforms, Spark, etc. Some places I’ve been in have been ‘small’ and some ‘big’ and some (academia) a weird mix of both.

I’ve been wondering lately around the pro’s and con’s of each. The big places are easy to target and make fun of but I think the usual criticisms around being too inflexible are a little too simplistic. The fact of where I am now is that it really does work to support a huge variety of ways to get work done. There’s a price though – you have to get the work done, not always the work you want to do.

There’s still a serious aspect of outsourcing activities to specialist to maintain ‘the system’ even as one has doubts about their abilities. It’s really not fair to say it’s a doubt about their ability though. These systems are designed to have support specialists who have to be agnostic about the individuals and systems they are supporting. They are not a part of your team.

While there are many tools to play with at a big place – at least my big place – there’s still always an ambient pressure to conform to a specific way of doing things which can be chafing according to your personality. Ironically, it could be my history or maybe it’s just my personality, but that friction is pretty irritating to me. For example, I’ve had my .bashrc file replaced from under me because the system is designed to support people who want to ‘do datascience’ without having to handle the day-to-day details of running a computer.

Neo4J at Home, datasets, and Kaggle

I’ve decided to put in occasional time at home building some proficiency with Neo4J, the graph db server. It’s trivially easy to install. Now I first played with it maybe 3-4 years ago and revisit it every few months or so. I would not say I am particularly good at using it yet.

For home use, I’m going to start with public datasets and see if I can use those. I got access to the Panama Papers sandbox, but the usage period expired before I could actually kick the tires on it. Hopefully I can re-up that. The sandbox environments appear to be web-based (cloud hosted?) db instances that you can use without having to have the server up on your machine.

First attempt with a new data set was based off of Kaggle where they have a dataset repository to which folks can contribute. In passing, getting more deeply familiar with Kaggle generally is something I’d also like to do. A search on the Kaggle dataset page for Neo4j turns up something called ‘awesome-datascience’. Without reading it too closely, I download it (it’s a zip), set up a directory for datasets and unzip it. The authors claim it is a ‘fully functional database’. Not sure what that means in the sense that it is not functional the way a server is. It does have data though. It also has a lot of metadata that appears to have information about the origin of the data, but it’s not cleanly formatted. There appear to be a bazillion zlib files about which I can tell nothing.

Now one of the things which Neo4j does not do so well, in my opinion, is allow for the easy use of multiple datasets and swapping between datasets. You appear to always have to tell Neo4j via config file where the data is you want to use. I’ve played with schemes using symbolic links and the like, but it all feels home-brew/clunky. For now, I’ve decided to keep a dedicated directory to storing the datasets and just copy them to the default Neo4j database when I want to use one. Now with this dataset (‘awesome datascience’) I was immediately met with issues that the dataset was created with an older version of Neo4j. If I understand Kaggle correctly, this is the ‘hottest’ graph dataset they have so that’s a bit worrying. So I altered the Neo4j config file to allow Neo4j to update the data structures to reflect my version of the server and away we go. I was able to get it up and going. Looking at the metagraph and going back and reading the docs it appears that the creators of this graph built it off of a dataset used for a book called ‘Data Science Solutions’ and has an associated website: https://github.com/awesomedata/awesome-public-datasets It is the scraping of that site on which this graph is based.

So the graph describes relationships between things like categories, access, datasets, catalogs, lists, tools, etc. found on that site. Now to start working with it. Note..

Email and another Thing

I left my work computer at work this weekend. I needed to go back in on a very blustery, nearly -snowy/rainy day but even so, I chose to leave it there. I’m trying to work out ways to balance the time I spend on different things; I want to get back to my private scientific and electronic stuff (a home brew ADC and aiming at a small plasma study setup), explore some of the modern scientific programming paradigms more deeply, and play with abstract algebra and algebraic geometry. How can I do that if I’m always going back to get ahead on work? So I left the laptop at work.

When I got home, I spent some time tossing lots of email in the garbage in an ocd-like effort to streamline life. In doing that, I realized that for my gmail account, maybe 70% of the unread mail are blogs from this place called Knoldus. Cool emails about a variety of programming stuff, but some of it is on the edge of how relevant or interesting it is to me, even though they touch on topics indirectly which I am interested in. So to be disciplined I decided I did not and do not have the time to go through their paradigm of understanding Scala. I was sorely tempted to not do this as I feel Scala is really the modern language to learn if you want to pick one. I can come back to it later (right?).

Since my sandbox machine now has a decent amount of disk space (1 Tb with another 1 Tb coming), I can add a bit, so I’m going to install Neo4J (overlap with work, but interesting in its own right) and get back to installing hadoop and spark, especially given that I have one more machine that is perfectly functional with 1 Tb. It currently has Windows but I’m going to spin it into a NixOS machine I think. Can you build a cluster with different flavors of *nix? Guess I’ll find out!

Gary Marcus and Ancient Critiques of Deep learning

I finally got around to reading a paper (and I do not have a proper reference – maybe that suggests something?) that had been lying about for over a year from an NYU professor named Gary Marcus. Given the volume of references he has to himself, it’s shocking that I haven’t blundered into him before.

I liked the way the paper is organized – motivation, points, discussion. The content, perhaps motivated by the sloppy and ego-driven style, encouraged me to want to push back on every point. I think his larger point is that Deep Learning has broad weaknesses that have kept it from informing a new race of intelligent robots. Each point has some substance, but it’s not clear to what end – they are not technical problems (and he admits as much). As an example, the point is made that there is insufficient integration of other domain expertise into solving a problem. While this is correct, that’s really a criticism about how I organize my tools to solve a problem, not in the efficacy of a particular tool. It’s unfortunate, because that particular paragraph is well written and is one of the few extended portions that has a pleasant logic and flow. I feel that if he had raised his head a bit and organized the intent of each criticism, it would have been a more interesting and engaging read.

A minor note, but one that caught my eye, was his latching onto the relationship between correlation and causation. It was, again, not well written in a prosaic sense, but the gist was clear – that this distinction is somehow ‘fundamental’. And again, there are missing words: fundamental to what? Fundamental to reproducing intelligence? Is that the real goal here?

What caught my eye here is that I know there’s at least one other recent piece I came across in the past year that hit on this idea that a weakness in Deep Learning is not mastering causation vs. correlation and it bothered me. That distinction is barely understood among humans and I’m not convinced it is understood among any random selection of conscious, living creatures among whom I am relatively inclined to label as ‘intelligent’.

Clones

I spent a little time this weekend playing with cloning of hard drives. One of my computers has been saddled with a little 80 Gb drive from day 1 even as the replacement 1 Tb drive has sat in its unopened package for several months. Like many people, the holidays are a time where I think about dotting some ‘i’s, crossing some ‘t’s etc, and replacing that drive has been a goal for some time.

I had played in a slightly messy way with adding a drive to one of my laptops using a cloned drive approach. With that one, I used dd to clone the drive but then realized that doing so gave the hardware an identical name to the existing one. It seems obvious now, but I didn’t realize that the intent behind replacing drives and cloning is typically to actually ‘replace’ them, not have them run concurrently. So I went down the rabbit hole of how to get a new UUID, editing grub.cfg files, etc. and eventually life worked. I documented much of it but I’m sure I didn’t get it all down. So I knew that this time, I’d probably have a learning curve again (and I wanted to try a different approach.

I ended up doing the following:

clone the current hard drive to a bootable USB drive.
clone the current hard drive to the new 1 Tb drive.

The idea of saving to a bootable USB drive was intriguing to me, even if the practical benefit seems limited. The guide for this was a page in the Debian wiki meant to be on the subject of partclone – an app that allows one to clone single partitions. In short, the process was as follows:

On the legacy system, use gparted to create a single bootable partition on a USB drive. One ‘gotcha’ here was that you start by creating a partition table for the drive. There are multiple partition table types out there and I chose BSD (as in BSD Linux) since this sounded closest to a Linux-happy system. That turned out to be a bad choice. You want to use msdos. This isn’t the first time I’ve seen it, but it’s strange to have to rely on something linked to Microsoft to facilitate a Linux operation. Per the recommendation on the Wiki, the filetype for the partition was ext-2. I used a 29 Gb drive.
Mount the usb drive.
Run blkid to identify the UUID values for everyone, notably for the USB drive. The UUID is used to modify the subsequent fstab.
Use dd to create swap space (1 Gb from /dev/zero)
Use rsync to copy everything from / on the legacy system to the USB drive (at its mount point). Options ‘auv’ included. (archive, update, verbose)
1. The rsync was done excluding the contents of many of the legacy system directories (not the directories themselves however):
  1. /mnt/*
  2. /media/*
  3. /sys/*
  4. /tmp/*
  5. /etc/fstab
  6. /proc/*
  7. /lost+found/*
  8. /dev/*
  9. /home/user/[a-zA-Z0-9]*
2. Edit the new /etc/fstab to reflect the UUID of /
3. update-grub
4. grub-install –recheck –root-directory=usb_mountpoint /dev/sdx (where sdx is the usb device)

Now I did run into some problems, notably when I attempted to have the USB drive run at the same time as the hard drive. Also learned of a suprising amount of flexibility in the current system regarding who to use to boot. As usual, you can specify the boot priority in the BIOS (F2). On boot, you can also press F12 to directly intervene to get to who you want to boot, overriding what is the BIOS default. Additionally, grub.cfg, created when you you run update-grub, is aware of the boot capabilities of all attached hardware. Thus, when you boot from the USB drive, it is still aware of the boot entries for the hard drive. In retrospect, I think the old LILO used to do this to.

When I booted from the USB stick, it still ended up mounting the HD. It looked like update-grub did not push the UUID to all relevant places for the menu items related to the USB. I manually edited and then sort of worked but kept getting an fsck error even though I had set the last entry of fstab to 0 for the drive – which should theoretically keep it from running.

Eventually, I came across some postings that got me to the answer: there’s something called update_initramfs which needs to be run. I could do this, because fsck failure came at a stage after the disk is mounted but before the X-system is fully up. Thus, I could not log in at the graphical interface but I could open a text console. There, I was able to run update_initramfs and, voila!, I could run happily. I also re-ran update-grub, but I’m not sure when. It’s the one nagging thing that bugs me – did I run that correctly? I should not have had to have edited gsub.cfg. Maybe if I had run the update_initramfs earlier? Well, something for later. For now, yes, I have my system fully available and bootable from that usb drive.

Stage 2: move to larger drive.

Here, I went the clonezilla route which seems to have a lot of fans. I was troubled by the poor quality of their website and the stilted quality of their documentation, both of which felt amateurish. But I did the the following:

Create a ‘live’ clonezilla usb drive. The clonezilla website has directions for this -you need to use gparted to create a Fat-32 partition on the drive and make it bootable. I don’t recall if I ran a dedicated clonezilla app to then copy clonezilla to the drive or not.
Boot from the clonezilla usb and run the cloning program. It has lots of options so you need to track with the website on the navigation (clonezilla supports lots of things including restoration of backups). The first menu item (and the easiest to overlook) was to go directly to the menu item that looks like ‘other’ or something similarly vague. After that, the options are more intuitive.
The execution of the cloning process is accompanied by a few questions which are difficult to read (yellow text on white background). I needed to copy the question and copy to a local emacs session just to make sure I was reading what they were asking before responding.
Once it started, the cloning process took around 10 minutes. I rebooted and resized the partitions on the new drive using gparted with the legacy drive. Powered down, remove the original drive, and reboot. All looks good!

Because no story can end happily, here’s what happened next. I went to replace the physical spot inside my machine containing the 80 Gb drive with the new one. I removed the existing one. I had earlier noted that the data connector to the legacy drive was weird in that the pins were ‘free-standing’ like real pins rather than backed by a plastic support (similar to USB connectors) and in fact had seen that one or two pins were bent but was able to straighten them out by hand. Now, the SATA cable did not fit the new hard drive! I thought there was some shift in SATA design and maybe this was just old, but I eventually figured out was: the original drive did have a platic backing but this had broken off and was stuck inside the cable side. That also explained the bent HD pins. So, I need a new SATA cable (the one I used for the new drive throughout is a 6″ temporary, not suitable to sit inside the box after I close it up). No hurry for that though and I think things are okay.

TF, Python, Virtualenv, Disk Space

With a newly upgraded Debian dist (to Stretch – yes, I have one more to go!) I began testing how the environment looks. Couple things which I think are different:

Jupyter – I guess this is expected to be run from a virtual-env? I don’t recall that. But all the usage links online suggest that as the way to go.
Virtualenv. More generally I went down the rabbit hole of trying to figure out the implications of pip, virtualenv and apt-get-installed python. A few links, taking care to try and figure out which ones are out-of -date and gleaning what’s common. Seems like the way to go is to
1. Have a distribution-based python enviro (i.e. with apt-get).
2. sudo -H pip3 install –upgrade pip
3. sudo -H pip3 install virtualenv
4. Now rely on virtualenvs.
I’m still a little murky on the -H which ties the environment to a user home directory. Also, at least one link (was good but a bit dated) suggested using
- pip install –user
This would install things in a directory ‘/home/steve/.local/…’. It’s kind of unclear what benefit this brings other than having a local ‘base’ python from which to extend pip with “virtualenv –system-site-packages”
For now, I’m just going vanilla virtualenv

Now having happily done this, I look to start up a virtualenv and add tensorflow.

Issues:

Install of tensorflow with pip3 crashes with space full. Admittedly I have a small drive (75 Gb) but I know that earlier this year I had TF installed. Well, I thought maybe it’s time to bite the bullet and clone my drive to my 1 Tb drive I’ve had around for this purpose. Spent some time trying to review how I cloned a drive recently on my Flint machine. Slept, thought about it over breakfast and instead decided to proceed as follows:
1. Figure out exactly where and why disk space was dying (suspicion was root partition). If not curable, then:
2. Use Gparted Live usb drive to see if I can manipulate unused space to get big enough to work. If not curable there, then:
3. Mount new drive, use dd to clone original to bigger drive, update-grub on new drive from gparted live (I think), and reboot, hoping I don’t have to do the manual edit of grub.cfg and /etc/fstab which I needed to do with Flint.
Well it turns out that the issue was just TMPDIR. I could install just fine by:
TMPDIR=… pip3 install -vvv tensorflow
Then next issue: start up python and import tensorflow core-dumps with illegal instruction. Some well written posts (on github, bookmarked) show via good detectivework that since TF 1.6, binaries have been compiled with CPU optimizations that assume something called ‘AVX’ capability which ‘older’ CPUs won’t support. Solution is to compile from source or use tensorflow==1.5
I’d like to try the compile from source so that will be my next effort. This exercise has me thinking about giving a gpu a try. But should I do the source compile first or up my disk space?

Postscript:

Playing with it a bit more, it looks like I have a site-installed version of TensorFlow (v1.3) for python2.7. So maybe my memory failed in recalling that I had been working primarily with python 2. It’s only at work that I’ve fully embraded python3 and I’m sure I just confused myself. So if I create and start up a virtualenv with site-packages then I get the site-installed version of TensorFlow for python2.

Upgrading an Ancient Debian Install

What better place to keep notes on it.

1st, made a variety of updates to my debconf which had become outdated as my current version (jessie) is so old it’s only part of the debian ‘archives’ project.

On Aug 11 started some backups

/etc and /home tarballed and gzipped and sent over to traversecity (/home/muskegon/backups/08012019/)
Spent some time reviewing current disk usage on my tiny 75 Gb hard drive. Surprisingly, all is reasonable. If the upgrades go well, I’ll finally put in that 500 Gb drive.
1. I went to use gparted as well as various command line tools and that sent me down the rabbit hole to figure out how to use sudo – never had a compelling need since I’m a single user system and I’m not afraid to become root as needed. Sudo is an interesting topic but no time to go deep on it.
Reviewed and bookmark open links on browsers. I’m running Konqueror and Chrome. For the record:
Chrome:
On Konqueror

Most of these are bookmarked accordingly. Next step is to continue with steps that look reasonable as obtained from here:

https://www.howtoforge.com/tutorial/how-to-upgrade-debian-8-jessie-to-9-stretch/

apt-get update (some errors on older repositories)

apt-get upgrade

edit to get stretch

apt-get update failed on thomas.net (tuxboot)

apt-get dist-upgrade

Saw quite a few warnings on updates that could break things. Most seem okay, a touch concerned about xorg, glibc, and linux-new with the implication that I need a 3.2 kernel or newer but I think I have that.

Now I need to upgrade glibc but kdm is running. So I need to shut down I think.

Got a notice about modified sysctl.conf I kept my mod which had disabled ipv6, I think that was related to getting hadoop up and running.

Also with sane.d/dll.conf I think this has to do with scanners and I have no idea why that would be different!

ssh_conf

Looks like the linux kernel will be going from 3.16 to 4.9 (so I think the earlier note about needing 3.2 or newer is okay)

And at the end:

Errors were encountered while processing

apt-listchanges
E: Sub-process /usr/bin/dpkg returned an error code (1)

Googling got me to a very fine answer here:

https://unix.stackexchange.com/questions/365899/apt-get-error-apt-listchanges-and-debconf?newreg=d25889e4c8f948d4971a7d4d739b179c

The relevant comment:

I started getting this error when I upgraded my computation server from Debian jessieto Debian stretch.

My problem was that I had (foolishly) manually installed Python 3.5 system-wide before the upgrade to stretch, and that version of Python was ‘masking’ the default stretchPython 3 install. In particular, these factors were at play:

My manual v3.5 install had put its python3 symlink into /usr/local/bin
The Debian python3 system package had installed symlinks into /usr/bin
/usr/local/bin was earlier in my $PATH than was /usr/bin

So, to fix this specific problem, all I had to do was rename the /usr/local/bin/python3symlink to, e.g., /usr/local/bin/python3-local, and then import debconf worked fine after a python3 invocation.

A more complete solution would probably be a total uninstall of the system-wide manual version of Python 3.5, and re-installing it sandboxed.

So, I did the same, verified that I could import debconf when I ‘unhid’ the debian python, and set about doing the upgrade again.

This time, the upgrade popped through with no changes. I was a bit suspicious but rebooted.

When I came back, I had a couple issues:

Not automatic graphical environment. This is due to KDE switching from kdm as the display manager to something called sddm. Fix:
- dpkg-reconfigure sddm
jupyter had issues because it had been previously installed manually (I think). Now the easiest thing is to install jupyter-notebook

I’ll continue noting as I find issues.