R is a shockingly dreadful language for an exceptionally useful data analysis environment. The more you learn about the R language, the worse it will feel. The development environment suffers from literally decades of accretion of stupid hacks from a community containing, to a first-order approximation, zero software engineers. R makes me want to kick things almost every time I use it.
I’ve assumed that you know a bit of programming, so you are probably familiar with loops and conditional expressions. I won’t cover these (or any) general programming concepts, but I want to discuss two specific points. The first one is the importance of pacing your bot. The second one is how to iterate over searches on LexisNexis Academic. This second point is really about LexisNexis, not about webscraping in general, so you can safely skip it if that’s not the site you want to webscrape.
the importance of pacing
Your computer can fetch online content much faster than you can, so it’s tempting to just release the beast (i.e., your bot) into the wild and let it move full speed ahead. But that’s a dead giveaway. You want your bot to pass for a human but if it moves at blazing-fast speeds that may set off all kinds of alarms with the administrators of the website (or with the bots they’ve built to do detect enemy bots).
Hence you need to pace things. To do that just insert a time.sleep(seconds) statement between each iteration of the loop. Do a few searches manually first, see how long it takes, and use that information to set seconds in a way that slows your bot down to human speed.
Better yet: make seconds partially random. Something like this:
random.random() will generate a random number between 0 and 1. So we are randomizing the delay between searches, which will vary between 5 and 10 seconds. That looks a lot more like human activity than a uniform delay. Try doing 100 searches with exactly 5 seconds between them. You can’t. And if you can’t do it then you don’t want your bot to do it.
“Then why would I want to webscrape in the first place? If the bot can’t go faster than I can then what’s the point? I could simply manually fetch all the content I want.” You could and you should, if that’s at all feasible. Building a webscraping bot can take a couple of weeks, depending on the complexity of the website and on whether you have done this before. If fetching everything manually would take only a couple of minutes then there is little reason to do it programmatically.
Webscraping is for when fetching everything manually would take days or weeks or months. But even then you won’t necessarily be done any faster. It may still take weeks or months or years for your bot to do all the work (well, hopefully not years). The key point is: webscraping is not about finishing faster, it’s about freeing you to work on other, more interesting, tasks. While your bot is hard at work on LexisNexis or Factiva or any other site you are free to work on other parts of your dissertation, finish a conference paper, or binge-watch House of Cards on Netflix.
Also, fetching online content manually is error-prone. If you are doing it programmatically everything is transparent: you have the code, hence you know exactly what searches were performed. You can also log any errors, as we saw in part 3, so if something went wrong you will know all about it: day, hour, search expression, button clicked, etc.
But if you’re doing things manually how can you be sure that you did search for Congo Brazzaville and not for Congo Kinshasa instead? Imagine how tired and bored you will be by day #10. Do you really trust yourself not to make any typos? Or not to skip a search? You can hire undergrads to do the work, but if you can can make mistakes while doing it then imagine people who have no stake whatsoever in your research results.
So, even if your bot doesn’t go any faster than you would you will still be better off with it.
All that said, in part 5 (coming soon) we will see that you can actually make things go faster - if you have multiple bots. But that’s dangerous in a number of ways you need to know about all the dangers first. So hang in there.
looping over searches on LexisNexis Academic
Back in part 1 we submitted a search on LexisNexis Academic. We searched for all occurrences of the word “balloon” in the news that day. In part 2 we went to the results page and saw that there were 121 results. We then wrote some code to retrieve those 121 results.
That was all fine and dandy for introductory purposes but the thing is, that code only works when the number of results is between 1 and 500. If there are 0 results we don’t get the results page, we get this page instead:
Selenium will look for the ‘fr_resultsNav…’ frame (remember that?), won’t find it and will throw a NoSuchElementException.
Conversely, if there are over 3000 results we get this page instead:
Same as before: Selenium will look for the ‘fr_resultsNav…’ frame, won’t find it and will throw a NoSuchElementException.
Finally, if the number of results is between 501 and 3000 the code from part 2 will work fine up to the point where the “Download” or “Send” button is clicked (according to whether you are downloading the results or having LexisNexis email them to you). Then LexisNexis will give you an error message.
Yep, we can only retrieve 500 results at a time. The code from part 2 tries to download/email “All Documents”. But here we have 587 results, so we can’t do that.
You can see where this is going: you will need to branch your loop in order to account for those different scenarios.
Selenium-wise there is nothing new here so I won’t give you all the code, just pieces of it.
Scenario #1: no results
We need to locate the “No Documents Found” message that we get when there are no results. You already know how to find page elements (see part 1 if you don’t). But we can’t simply use browser.find_element_by_. If we do and we are not on the “no results” page Selenium will fail to find the “No Documents Found” message and the code will crash. Hence we need to encapsulate browser.find_element_by_ inside a try/except statement. If the “No Documents Found” element is found then we click “Edit Search” (top of the page) and move on to the next search. Otherwise we have one or more results and hence we need to move on to the results page.
Here’s some pseudocode for that (say we know the id of the element).
This works. It’s not the most elegant solution though. Not hitting the “no results” page is not exactly an error. So it feels weird to treat it as such.
Selenium doesn’t have a “check if element exists” method, but we can emulate one. Something like this:
I know, we didn’t get rid of the try/except statement. But at least it is now contained inside a function and we don’t have to see it every time we need to check for some element’s existence based on its id.
You can write similar functions for other identifiers (name, xpath, etc). You can also write a more general function where you pass the identifier as an argument. Whatever suits your stylistic preferences.
You may want to log any searches that yield no results.
Scenario #2: more than 3000 results
This is similar to “no results” scenario, with only two differences. First, we need to look for the “More than 3000 Results” message (rather than the “No Documents Found” message); as before, we need to look for that message in a “safe” way, with a try/except statement or a user-defined function. Second, we have the option to go back and edit the search or proceed to the results page.
We can choose the latter by clicking the “Retrieve Results” button. But caution: when there are more than 3000 results the results page will only give us 1000 results. I don’t know what criteria LexisNexis uses to select those 1000 results (I asked them but they never bothered to reply my email). Depending on what you intend to do later you may want to consider issues of comparability and selection bias.
Scenario #3: 1-3000 results
If we are neither on the “no results” page nor on the “3000+ results” page then our first step is to retrieve the total number of results.
That number is contained in the totalDocsInResult object, as attribute “value”. Here is the object’s HTML code:
totalDocsInResult, in turn, is contained inside the fr_resultsNav... frame that we already saw in part 2, which in turn is inside ‘mainFrame’. We already know how to move into fr_resultsNav... (see part 2). Once we are there extracting the total number of results is straightforward.
(totalDocsInResult stores the number as a string, so we need to use int() to convert it to a number.)
If we have between 1 and 500 results nothing changes and we can use the code from part 2. But if we have between 501 and 3000 results that code won’t work, since we can only retrieve 500 results at a time. We need to iterate over batches of 500 results if we have 501-3000 results. Here is some starter code.
Lines 1-5 create the necessary accumulators and start the loop. Lines 16-18 fill out the “Select Items” form, which we didn’t need in part 2 (we had fewer than 500 results, so just selected “All Documents”). Line 19 clicks the “Send” button.
As before, once we click “Send” LexisNexis will shove the results into a text file and email it. Generating that file may take a while. The more so since we are now selecting 500 results, which is a lot. It may take a whole minute or so before the “Close Window” button finally appears on the pop-up.
That’s why we need the explicit wait you see in lines 20-25 (see part 3 if this is new to you). If the “Close Window” button takes over two minutes to appear we close the pop-up by brute force (and we hope that the file with the results was generated and sent correctly).
Lines 26-30 update the accumulators and lines 31-37 take us back to the results page.
Naturally this entire loop will be inside the big loop that iterates over your searches. I won’t give you any code for that outer loop, but really it’s simpler than the inner loop above.
Here I only used an explicit wait for the “Close Window” button but of course you will want to sprinkle explicit waits whenever you feel the need (i.e., whenever your code crashes while trying to locate an element or interact with it). Review part 3 if needed.
That’s about it for now. On the next post we will cover headless browsing and parallel webscraping.
In part 2 we learned how to handle dynamic names and how to download content with Selenium. Here we will learn how to make our code robust to network flukes.
handling errors
When you run a regression multiple times the result is always be the same, provided that the data and code you are using are the same. You run it a million times and there it is, same result. In other words, the result is deterministic.
With webscraping, however, the result is probabilistic. Sometimes a page element doesn’t load properly. Sometimes the servers are too busy to respond to a click. Sometimes your own internet connection flickers for a millisecond. And so on.
In LexisNexis, for instance, sometimes you get this:
In these cases Selenium will fail to find the elements you want and will crash. Selenium will throw out error messages like NoSuchElementException or NoSuchFrameException. If you’ve tried the code from parts 1 and 2 you may have encountered these errors already. It’s not that the code is wrong, it’s just that it is incomplete; we haven’t prepared it for network flukes. Let’s do it now.
One thing we can do is ensure that Selenium waits for a few seconds before it gives up on finding elements. There are different ways to do that. First there is the implicit wait statement:
This statement makes Selenium wait up to 30 seconds before throwing an exception. You set the time limit once in your code and it is valid for the entire session.
Alternatively, you can set individual wait parameters for each action. To do that we first need to import a bunch of other stuff from the Selenium bindings:
Now suppose that we want to wait for up to two minutes before we declare an element “missing”. Let’s say that the element is a button and that we know its CSS selector. We can do this:
Selenium will look for the element every 500 milliseconds and, as soon as the element is found, the wait is over. If 120 seconds elapse and the element hasn’t been found, Selenium throws a TimeoutException.
You need to decide what to do about the TimeoutException. Do you re-try a couple of times? Do you go back to the search page and move on to the next search? That of course depends on your particular research needs. But whatever path you choose you want your code to handle that exception gracefully. In Python that is done with try/except statements, like this:
That way your code won’t crash when Selenium throws a TimeoutException. It will do whatever is inside the except statement instead.
Here we used the element_to_be_located condition, but that is not always what we need. Sometimes the element is located but cannot be interacted with (yet). Selenium offers wait conditions for several different possibilites. For instance, sometimes the element is located but Selenium crashes and the error message says that the element is not clickable. In that case we can do something like this:
Deciding what elements to (explicitly) wait for, with what conditions, and for how long is a trial-and-error process. Run your code without any waits first and see where it crashes. Add a wait condition for the problematic element, encapsulate the wait condition within a try/except statement, and run the code again. Repeat until your code doesn’t crash anymore.
This is often a frustrating process and you’ll need patience. You think that you’ve covered all the possibilities and your code runs for an entire week and you are all happy and celebratory and then on day #8 the damn thing crashes. The servers went down for a millisecond or your Netflix streaming clogged your internet connection or whatnot. It happens.
It’s always a good idea to log errors. You can create a log file in the beginning of your code, like this:
And then add an entry to that file every time you get a TimeoutException:
Once your code has finished running you can inspect the log file and see what searches you need to re-do.
In part 1 we learned how to locate page elements and how to interact with them. Here we will learn how to do deal with dynamic names and how to download things with Selenium.
handling dynamic names
In part 1 we submitted a search on LexisNexis Academic. We will now retrieve the search results.
The results page of LexisNexis Academic looks like this:
Our first task is to switch to the default frame of the page.
Now we need to click the “Download Documents” button (it’s the one that looks like a floppy disk; it’s right above the search results). We already know how to do that with Selenium: right-click the element, inspect its HTML code, scroll up to see what frame contains it, use all this information to locate the element and interact with it. We’ve learned all that in part 1. By following that recipe we find that the “Download Documents” button is inside the frame named “fr_resultsNav~ResultsMaxGroupTemplate0.6175091262270153″, which in turn is inside the frame named “mainFrame”. So our first instinct is to do this:
Except it won’t work here.
Here is the problem: that fr_resultsNav~ResultsMaxGroupTemplate0.6175091262270153 frame has a different name every time you do a new search. So your code will miss it and crash (which is precisely what LexisNexis wants to happen, since they don’t care for webscrapers).
What are we to do then? Here the solution is simple. That frame name always changes, but only partially: it always begins with fr_resultsNav. So we can look for the frame that contains fr_resultsNav in its name.
Our dyn_frame object contains the full frame name as an attribute, which we can then extract and store.
Now we can finally move to that frame and click the “Download Documents” button.
Great! We have solved the dynamic name problem.
Notice the sequence here: first we move to “mainFrame” and then we move to fr_resultsNav~ResultsMaxGroupTemplate…. The sequence is important: we need to move to the parent frame before we can move to the child frame. If we try to move to fr_resultsNav~ResultsMaxGroupTemplate… directly that won’t work.
Now, what if the entire name changed? What would we do then?
In that case we could use the position of the frame. If you inspect the HTML code of the page you will see that inside “mainFrame” we have eight different frames and that fr_resultsNav~ResultsMaxGroupTemplate… is the 6th. As long as that position remains constant we can do this:
In other words, we can switch to a frame based on its position. Here we are selecting the 6th child frame of “mainFrame” - whatever its name is. (As it is usually the case in Python the indexing starts from zero, so the index of the 6th item is 5, not 6).
switching windows
Once we click the “Download Documents” button LexisNexis will launch a pop-up window.
We need to navigate to that window. To do that we will need the browser.window_handles object, which (as its name suggests) contains the handles of all the open windows. The pop-up window we want is the second window we opened in the browser, so its index is 1 in the browser.window_handles object (remember, Python indexes from zero). Switching windows, in turn, is similar to switching frames: browser.switch_to_window(). Putting it all together:
That pop-up window contains a bunch of forms and buttons, but all we want to do here is choose the format in which we want our results to be. Let’s say we want them to be in a plain text file.
Finally we click the “Download” button.
So far so good.
downloading with Selenium
Once we click the “Download” button LexisNexis shoves all the search results into a file and gives us a link to it.
Now we are in a bit of a pickle. Let me explain why.
When you click that link (whether manually or programmatically) your browser opens a dialog box asking you where you want to save that file. That is a problem here because Selenium can make your browser interact with webpages but cannot make your browser interact with itself. In other words, Selenium cannot make your browser change its bookmarks, switch to incognito mode, or (what matters here) interact with dialog boxes.
I know, this sounds preposterous, but here is a bit of context: Selenium was conceived as a testing tool, not as a webscraping tool. Selenium’s primary purpose is to help web developers automate tests on the sites they develop. Now, web developers can only control what the website does; they cannot control how your computer reacts when you click a download link. So to web developers it doesn’t matter that Selenium can’t interact with dialog boxes.
In other words, Selenium wasn’t created for us. It’s a great webscraping tool - the best one I’ve found so far. I can’t imagine how you would even submit a search on LexisNexis using urllib or httplib, let alone retrieve the search results. But, yes, we are not Selenium’s target audience. But just hang in there and everything will be allright.
Ok, enough context - how can we solve the problem? There are a number of solutions (some better than the others) and I will talk about each of them in turn.
Solution #1: combine LexisNexis with some OS command
If you are on a Linux system you can simply use wget to get the file. wget is not a Python module, it is a Linux command for getting files from the web. For instance, to download R’s source code you open the terminal and do
The trick here is to find the URL behind the link LexisNexis generates. That link is dynamically generated, so it changes every time we do a new search. It looks like this:
If you stare at this HTML code long enough you will see some structure in it. Yes, it changes every time we do a new search, but some parts of it change in a predictable way. The news source (All_English_Language_News) is always there. So are the date (“2013-11-12”) and the hour (“22-26”) of the request. And so is the file extension (“.TXT”). We can use this structure to retrieve the URL. For instance, we can use the “.TXT” extension to do that, like this:
Now we have our URL. On to wget then. This is an OS command, so first we need to import Python’s os module.
Now we execute wget.
And voilà, the file is downloaded to your computer.
If you are on a Mac you can use curl instead (or install wget from MacPorts). There must be something similar for Windows as well, just google around a bit.
I know, platform-specific solutions are bad. I tried using urllib2 and requests but that didn’t work. What I got back was not the text file I had requested but some HTML gibberish instead.
Solution #2: set a default download folder
This one doesn’t always work. I only show it for the sake of completeness.
Here you set a default download folder. That way the browser will automatically send all downloads to that folder, without opening up any dialog boxes (in theory, at least). Here is the code:
It looks like a great solution, but often it simply doesn’t work at all. I’ve had trouble with it in Chrome and I’ve also had trouble with a similar solution for Firefox.
This is not surprising. The ChromeOptions capability is an experimental feature, as the code itself tells us (check the third line). Remember: Selenium wasn’t originally conceived for webscrapers, so it can’t make the browser interact with itself. The ChromeOptions capability was not created by the Selenium folks but by the chromedriver folks. Hopefully these tools will eventually become reliable but we are not quite there yet.
You may be thinking “what if I set the browser’s preferences manually?” It doesn’t work. The preferences you set manually are saved under your user profile and they are loaded every time you launch the browser but ignored when Selenium launches the browser. So, no good (believe me, I’ve tried it).
Solution #3: improve Selenium
If you are feeling adventurous you could add download capabilities to Selenium yourself. This guy did it (he also argues that people shouldn’t download anything with Selenium in the first place but he is talking to web developers, not to webscrapers, so never mind that). He uses Java but I suppose that a Python equivalent shouldn’t be too hard to produce.
Alas, that solution has 171 lines of code whereas the wget solution has only one line of code (two if you count import os), so I never bothered trying. But just because I was happy to settle for a quick-and-dirty workaround doesn’t mean everyone will be.
Solution #4: just don’t download at all
If you happen to be webscraping LexisNexis Academic there is yet another way: just have LexisNexis email the search results to you.
Code-wise there isn’t much novelty here. These lines remain the same:
But then we click the “Email Documents” button instead of the “Download Documents” button.
We get a pop-up window very similar to the one we saw before.
We switch to the new window.
We ask that the document be sent as an attachment and that it be in plain text format.
We enter our email address.
We create a little note to help us remember what this search is about.
And finally we send it.
That’s it. No platform-specific commands, no experimental features. The downside of this solution is that it is LexisNexis-specific.
This is it for now. On the next post we will cover error handling (if you are coding along and getting error messages like NoSuchElementException or NoSuchFrameException just hang in there; for now you can just add a time.sleep(15) statement before each window opens and that should do it; but I will show you better solutions). I will also show you how to make your code work for any number of search results in LexisNexis (the code we’ve seen so far only works when the number of results is 1 to 500; if there are 0 results or 500+ results the code will crash). In later posts we will cover some advanced topics, like using PhantomJS as a browser.