Neo4J at Home, datasets, and Kaggle

I’ve decided to put in occasional time at home building some proficiency with Neo4J, the graph db server. It’s trivially easy to install. Now I first played with it maybe 3-4 years ago and revisit it every few months or so. I would not say I am particularly good at using it yet.

For home use, I’m going to start with public datasets and see if I can use those. I got access to the Panama Papers sandbox, but the usage period expired before I could actually kick the tires on it. Hopefully I can re-up that. The sandbox environments appear to be web-based (cloud hosted?) db instances that you can use without having to have the server up on your machine.

First attempt with a new data set was based off of Kaggle where they have a dataset repository to which folks can contribute. In passing, getting more deeply familiar with Kaggle generally is something I’d also like to do. A search on the Kaggle dataset page for Neo4j turns up something called ‘awesome-datascience’. Without reading it too closely, I download it (it’s a zip), set up a directory for datasets and unzip it. The authors claim it is a ‘fully functional database’. Not sure what that means in the sense that it is not functional the way a server is. It does have data though. It also has a lot of metadata that appears to have information about the origin of the data, but it’s not cleanly formatted. There appear to be a bazillion zlib files about which I can tell nothing.

Now one of the things which Neo4j does not do so well, in my opinion, is allow for the easy use of multiple datasets and swapping between datasets. You appear to always have to tell Neo4j via config file where the data is you want to use. I’ve played with schemes using symbolic links and the like, but it all feels home-brew/clunky. For now, I’ve decided to keep a dedicated directory to storing the datasets and just copy them to the default Neo4j database when I want to use one. Now with this dataset (‘awesome datascience’) I was immediately met with issues that the dataset was created with an older version of Neo4j. If I understand Kaggle correctly, this is the ‘hottest’ graph dataset they have so that’s a bit worrying. So I altered the Neo4j config file to allow Neo4j to update the data structures to reflect my version of the server and away we go. I was able to get it up and going. Looking at the metagraph and going back and reading the docs it appears  that the creators of this graph built it off of a dataset used for a book called ‘Data Science Solutions’ and has an associated website: https://github.com/awesomedata/awesome-public-datasets It is the scraping of that site on which this graph is based.

So the graph describes relationships between things like categories, access, datasets, catalogs, lists, tools, etc. found on that site. Now to start working with it. Note..

Leave a comment