a guide to memory usage in R

If you are using R and it is choking on your large dataset you may want to read this. It’s a chapter from Hadley Wickham’s forthcoming book, “Advanced R Programming”.

webscraping with Selenium - part 1

If you are webscraping with Python chances are that you have already tried urllib, httplib, requests, etc. These are excellent libraries, but some websites don’t like to be webscraped. In these cases you may need to disguise your webscraping bot as a human being. Selenium is just the tool for that. Selenium is a webdriver: it takes control of your browser, which then does all the work. Hence what the website “sees” is Chrome or Firefox or IE; it does not see Python or Selenium. That makes it a lot harder for the website to tell your bot from a human being.

In this tutorial I will show you how to webscrape with Selenium. This first post covers the basics: locating HTML elements and interacting with them. Later posts will cover things like downloading, error handling, dynamic names, and mass webscraping.

There are Selenium bindings for Python, Java, C#, Ruby, and Javascript. All the examples in this tutorial will be in Python, but translating them to those other languages is trivial.

installing Selenium

To install the Selenium bindings for Python, simply use PIP:

pip install selenium

You also need a “driver”, which is a small program that allows Selenium to, well, “drive” your browser. This driver is browser-specific, so first we need to choose which browser we want to use. For now we will use Chrome (later we will switch to PhantomJS). Download the latest version of the chromedriver, unzip it, and note where you saved the unzipped file.

choosing our target

In this tutorial we will webscrape LexisNexis Academic. It’s a gated database but you are probably in academia (just a guess) so you should have access to it through your university.

(Obs.: LexisNexis Academic is set to have a new interface starting December 23rd, so if you are in the future the code below may not work. It will still help you understand Selenium though. And adapting it to the new LexisNexis interface will be a nice learning exercise.)

opening a webpage

Now on to coding. First we start the webdriver:

from selenium import webdriver

path_to_chromedriver = '/Users/yourname/Desktop/chromedriver' # change path as needed
browser = webdriver.Chrome(executable_path = path_to_chromedriver)

When you run this code you’ll see a new instance of Chrome magically launch.

Now let’s open the page we want:

url = 'https://www.lexisnexis.com/hottopics/lnacademic/?verb=sf&sfi=AC00NBGenSrch'
browser.get(url)

The page looks like this:

locating page elements

Before we fill out forms and click buttons we need to locate these elements. This step is going to be easier if you know some HTML but that is not a pre-requisite (you will end up learning some HTML on-the-fly as you do more and more webscraping).

A page element usually has a few attributes - a name, an id, a CSS selector, an xpath, etc. (Don’t worry if you’ve never heard of these things before.) We can use these attributes to help us locate the element we want.

How can we find what these attributes are for a given element? Simple: just right-click it and choose “Inspect Element”. Your browser will then show you the corresponding HTML code. For instance, if you do this with the “Search Terms” form on the page we opened above you’ll see something like this:

The HTML code of the element you selected appears highlighted in blue. Let me copy and paste it below, so you can have a better look at it:

<textarea id="terms" style="vertical-align: top;" name="terms"></textarea>

Ha! Now we know two attributes of the “Search Terms” form: its name is “terms” and its id is (also) “terms”.

We are not ready to locate the element though. HTML pages usually contain multiple “frames” and our element is probably inside one of these frames. We need to know which one. To find out, start on that blue-highlighted line we saw before and keep scrolling up until you find <frame. You’ll eventually find this line:

<frame src="" name="mainFrame" id="mainFrame" title="mainFrame">

That means our “Search Terms” form is inside a frame named “mainFrame”. Now keep scrolling up to see if “mainFrame” is inside some other frame. Here it is not, but that is always a possibility and you need to check.

The next thing we do is go to that frame. Here is how we do it:

browser.switch_to_frame('mainFrame')

Once we are on the correct frame we can finally search for the element. Let’s search it using its id:

browser.find_element_by_id('terms')

And that’s it. We have located the element.

see the beauty?

As the code above shows, Selenium is very intuitive. To switch frames we use switch_to_frame. To find an element by its id we use find_element_by_id. And so on.

Another great feature of Selenium is that it’s very similar across all languages it supports. In Java, for instance, this is how we switch frames and find elements by id:

browser.switchTo().frame("frameName");
browser.findElement(By.id("elementId"));

So even if you first learn Selenium in Python it’s very easy to use it in other languages later.

interacting with page elements

Now that we’ve found the “Search Terms” form we can interact with it. First we want to make sure that the form is empty:

browser.find_element_by_id('terms').clear()

Now we can write on the form. Here we are interested in all occurrences of the word “balloon” in the news today. We start by writing “balloon” on the form:

browser.find_element_by_id('terms').send_keys('balloon')

Next we need to specify the date. There is a “Specify Date” drop-down menu. Let us locate it. As usual we start by right-clicking the element and selecting “Inspect Element”. That gives us the following HTML code:

<select class="input" id="dateSelector1" style="vertical-align: top;" name="dateSelector1">
  <option value="">All available dates</option>
  <option value="0:DY">Today</option>
  <option value="is">Date is…</option>
  <option value="before">Date is before…</option>
  <option value="after">Date is after…</option>
  <option value="from">Date is between…</option>
  <option value="1:WK">Previous week</option>
  <option value="1:MO">Previous month</option>
  <option value="3:MO">Previous 3 months</option>
  <option value="6:MO">Previous 6 months</option>
  <option value="1:YR">Previous year</option>
  <option value="2:YR">Previous 2 years</option>
  <option value="5:YR">Previous 5 years</option>
  <option value="previous">Previous…</option></select>

We can see the element’s name and id but here we will use neither. This is a drop-down menu and we will need to select one of its options (“All available dates”, “Today”, etc), so here we will use the element’s xpath. How do you get it? We are using Chrome here, so this is really simple: we just right-click the blue-highlighted line that corresponds to the element’s HTML code and select “Copy XPath”. Like this:

That gives us the following xpath:

//*[@id="dateSelector1"]

Now, as usual, scroll up from the blue-highlighted line until you find out which frame contains the element. Here that is the same frame of “Search Terms” (i.e., “mainFrame”), so we are already there, no need to move.

If all we wanted were to locate the element, we would do this:

browser.find_element_by_xpath('//*[@id="dateSelector1"]')

But we want to open that drop-down menu and select “Today”. So we do this instead:

browser.find_element_by_xpath('//*[@id="dateSelector1"]/option[contains(text(), "Today")]').click()

Now we’ve entered our search term (balloon) and selected our date (today). Next we need to select our news sources. That’s another drop-down menu, a bit further down the page. You know the drill: right-click the element, retrieve relevant attributes, scroll up to find out the frame. There isn’t anything new to learn here (and we haven’t left “mainFrame” yet), so I’ll just give you the code (let’s say we want to select all news sources in English):

browser.find_element_by_xpath('//*[@id="byType"]/option[text()="All News (English)"]').click()

Finally, we need to click the “Search” button (next to the “Search Terms” form) to submit the search. Same drill: right-click element, get attributes, scroll up to find frame. Except that here there is no id or name:

<input type="submit" value="Search" />

So we need to use xpath again, even though this is not a drop-down menu:

browser.find_element_by_xpath('//*[@id="searchForm"]/fieldset/ol/li[2]/span/span/input').click()

Now that is one ugly-looking xpath. Our code will look better if we use the element’s CSS selector instead:

browser.find_element_by_css_selector('input[type=\"submit\"]').click()

I don’t know of any “copy-and-paste” way to get an element’s CSS selector, but if you stare at the line above long enough you can see how it derives from the element’s HTML code.

That’s it. You should now see Chrome leaving the search page and opening the results page.

There is a lot more to cover, but that will have to wait.

(Part 2)

pandas' shortcomings

If you use pandas with big data you may want to check the presentation below, by Wes McKinney (pandas’ creator). He discusses why pandas doesn’t scale well and what he is doing about it (he is creating a new library - ‘badgers’; the benchmarks look promising).

Practical Medium Data Analytics with Python (10 Things I Hate About pandas, PyData NYC 2013) from wesm

hurry up if you are webscraping LexisNexis Academic

LexisNexis Academic will have a new interface on December 23rd. So if you’re webscraping them your code will probably stop working and you’ll need to rewrite it completely. Better hurry.

Python 2.7.6 released today

Here.