Getting Data and Webscraping

Getting back to semi-regular playing with different analytic and data tools out there. This weekend I spent time using python (3!) to build some data sets and start playing with some python tools for data handling and modeling (Pandas, scikit-learn). This was inspired by a link I’ve had up on my browser for a while on web-scraping to get data. So I started by looking for volcanoes. (is that the right pluralization for volcano?). I actually started with the python2 version and went through using urllib and BeautifulSoup and got them to play with a nice site from Oregon listing volcanoes.

Full disclosure: motivating post is

http://blog.kaggle.com/2017/01/31/scraping-for-craft-beers-a-dataset-creation-tutorial/

Volcanoes:

http://volcano.orgegonstate.edu

This went pretty well – got data and had it in a Pandas dataframe. Found nice docs on pandas, their ‘cookbook’, so started figuring out how to work with their dataframe. Pretty similar to R dataframes. After getting the data into that format, started exploring what exactly Pandas could get me (I’m a Python fanatic but never really needed Pandas oddly). That edged me toward playing with scikit-learn and reminded me to go look at these ‘regression’ methods that somehow never came up in HEP – thing like ridge and lasso. They just seem so…’non-fundamental’. Sort of like the deep need in physics to tie a parameter – especially one from a fit – to a physical entity. So the need to rely on ‘imposing a penalty’ seems like creating a new game that at a deep level is not reality. Physicists are used to dealing with the idea of ‘truth’, not approximations (pace ansatz).

I did hit on one place where I can see how the fork in the road between science and data science kind of comes into play. With data science, people are using data in any way they can, not really trying to understand it. Thus, getting into trouble with things like collinearities, is a hurdle. A physicist would want to understand and spend time figuring out the collinearity and then do the appropriate fit, basically orthoganalizing the system. Data scientists figure out ways to get a fit that ‘works’. So the idea of a cost – really a kluge- comes in to play, because the end-game is a good/predictive fit, not understanding.

Then thought that to get something interesting, wanted to bump this against some other data so went looking for earthquake data. Seem related.

I found the USGS data for earthquakes. First attempt to just imitate what I did with the volcano website failed miserably. Spent a day or more trying to figure it out. I could use urllib and BeatifulSoup to access the website, but what I wanted was to access the results of a search button on website. Because I didn’t understand BS well enough, I spent time thinking that it was the source of my troubles. Lots of pointless googling led me to understand that it was the javascript embedded features (I think, still  not completely sure) which posed too much of a challenge for the combinations of libraries I was using. My focus went to urllib (good, but not enough) and then to a package called selenium which did great. Selenium lets you interact with the browser right within python. Literally – as in, its use brings up a real browser when invoked. So I culminated things with finding how to run a ‘headless’ version of Chrome.   In the process I found a company really dedicated to scraping web data for customers. Neat idea.

So now I have two scripts to get volcano and earthquake data. The link between the two is location (lat/long). Am hoping to use this as a basis for playing more with pandas and scikit-learn.

 

 

 

Leave a comment