I just came across this PyData talk on Python tools for parallelizing machine learning applications. It’s worth watching if you have tons of texts and doing things sequentially is just not working for you.
You can find other PyData talks here. (The ones from the last PyData NYC, held last weekend, are already there.)
R is a shockingly dreadful language for an exceptionally useful data analysis environment. The more you learn about the R language, the worse it will feel. The development environment suffers from literally decades of accretion of stupid hacks from a community containing, to a first-order approximation, zero software engineers. R makes me want to kick things almost every time I use it.
I’ve assumed that you know a bit of programming, so you are probably familiar with loops and conditional expressions. I won’t cover these (or any) general programming concepts, but I want to discuss two specific points. The first one is the importance of pacing your bot. The second one is how to iterate over searches on LexisNexis Academic. This second point is really about LexisNexis, not about webscraping in general, so you can safely skip it if that’s not the site you want to webscrape.
the importance of pacing
Your computer can fetch online content much faster than you can, so it’s tempting to just release the beast (i.e., your bot) into the wild and let it move full speed ahead. But that’s a dead giveaway. You want your bot to pass for a human but if it moves at blazing-fast speeds that may set off all kinds of alarms with the administrators of the website (or with the bots they’ve built to do detect enemy bots).
Hence you need to pace things. To do that just insert a time.sleep(seconds) statement between each iteration of the loop. Do a few searches manually first, see how long it takes, and use that information to set seconds in a way that slows your bot down to human speed.
Better yet: make seconds partially random. Something like this:
random.random() will generate a random number between 0 and 1. So we are randomizing the delay between searches, which will vary between 5 and 10 seconds. That looks a lot more like human activity than a uniform delay. Try doing 100 searches with exactly 5 seconds between them. You can’t. And if you can’t do it then you don’t want your bot to do it.
“Then why would I want to webscrape in the first place? If the bot can’t go faster than I can then what’s the point? I could simply manually fetch all the content I want.” You could and you should, if that’s at all feasible. Building a webscraping bot can take a couple of weeks, depending on the complexity of the website and on whether you have done this before. If fetching everything manually would take only a couple of minutes then there is little reason to do it programmatically.
Webscraping is for when fetching everything manually would take days or weeks or months. But even then you won’t necessarily be done any faster. It may still take weeks or months or years for your bot to do all the work (well, hopefully not years). The key point is: webscraping is not about finishing faster, it’s about freeing you to work on other, more interesting, tasks. While your bot is hard at work on LexisNexis or Factiva or any other site you are free to work on other parts of your dissertation, finish a conference paper, or binge-watch House of Cards on Netflix.
Also, fetching online content manually is error-prone. If you are doing it programmatically everything is transparent: you have the code, hence you know exactly what searches were performed. You can also log any errors, as we saw in part 3, so if something went wrong you will know all about it: day, hour, search expression, button clicked, etc.
But if you’re doing things manually how can you be sure that you did search for Congo Brazzaville and not for Congo Kinshasa instead? Imagine how tired and bored you will be by day #10. Do you really trust yourself not to make any typos? Or not to skip a search? You can hire undergrads to do the work, but if you can can make mistakes while doing it then imagine people who have no stake whatsoever in your research results.
So, even if your bot doesn’t go any faster than you would you will still be better off with it.
All that said, in part 5 (coming soon) we will see that you can actually make things go faster - if you have multiple bots. But that’s dangerous in a number of ways you need to know about all the dangers first. So hang in there.
looping over searches on LexisNexis Academic
Back in part 1 we submitted a search on LexisNexis Academic. We searched for all occurrences of the word “balloon” in the news that day. In part 2 we went to the results page and saw that there were 121 results. We then wrote some code to retrieve those 121 results.
That was all fine and dandy for introductory purposes but the thing is, that code only works when the number of results is between 1 and 500. If there are 0 results we don’t get the results page, we get this page instead:
Selenium will look for the ‘fr_resultsNav…’ frame (remember that?), won’t find it and will throw a NoSuchElementException.
Conversely, if there are over 3000 results we get this page instead:
Same as before: Selenium will look for the ‘fr_resultsNav…’ frame, won’t find it and will throw a NoSuchElementException.
Finally, if the number of results is between 501 and 3000 the code from part 2 will work fine up to the point where the “Download” or “Send” button is clicked (according to whether you are downloading the results or having LexisNexis email them to you). Then LexisNexis will give you an error message.
Yep, we can only retrieve 500 results at a time. The code from part 2 tries to download/email “All Documents”. But here we have 587 results, so we can’t do that.
You can see where this is going: you will need to branch your loop in order to account for those different scenarios.
Selenium-wise there is nothing new here so I won’t give you all the code, just pieces of it.
Scenario #1: no results
We need to locate the “No Documents Found” message that we get when there are no results. You already know how to find page elements (see part 1 if you don’t). But we can’t simply use browser.find_element_by_. If we do and we are not on the “no results” page Selenium will fail to find the “No Documents Found” message and the code will crash. Hence we need to encapsulate browser.find_element_by_ inside a try/except statement. If the “No Documents Found” element is found then we click “Edit Search” (top of the page) and move on to the next search. Otherwise we have one or more results and hence we need to move on to the results page.
Here’s some pseudocode for that (say we know the id of the element).
try:# element_id = id of "No Documents Found" element
browser.find_element_by_id(id)# click "Edit Search"
# move on to next search
exceptNoSuchElementException:# move on to the results page
This works. It’s not the most elegant solution though. Not hitting the “no results” page is not exactly an error. So it feels weird to treat it as such.
Selenium doesn’t have a “check if element exists” method, but we can emulate one. Something like this:
I know, we didn’t get rid of the try/except statement. But at least it is now contained inside a function and we don’t have to see it every time we need to check for some element’s existence based on its id.
You can write similar functions for other identifiers (name, xpath, etc). You can also write a more general function where you pass the identifier as an argument. Whatever suits your stylistic preferences.
You may want to log any searches that yield no results.
Scenario #2: more than 3000 results
This is similar to “no results” scenario, with only two differences. First, we need to look for the “More than 3000 Results” message (rather than the “No Documents Found” message); as before, we need to look for that message in a “safe” way, with a try/except statement or a user-defined function. Second, we have the option to go back and edit the search or proceed to the results page.
We can choose the latter by clicking the “Retrieve Results” button. But caution: when there are more than 3000 results the results page will only give us 1000 results. I don’t know what criteria LexisNexis uses to select those 1000 results (I asked them but they never bothered to reply my email). Depending on what you intend to do later you may want to consider issues of comparability and selection bias.
Scenario #3: 1-3000 results
If we are neither on the “no results” page nor on the “3000+ results” page then our first step is to retrieve the total number of results.
That number is contained in the totalDocsInResult object, as attribute “value”. Here is the object’s HTML code:
totalDocsInResult, in turn, is contained inside the fr_resultsNav... frame that we already saw in part 2, which in turn is inside ‘mainFrame’. We already know how to move into fr_resultsNav... (see part 2). Once we are there extracting the total number of results is straightforward.
(totalDocsInResult stores the number as a string, so we need to use int() to convert it to a number.)
If we have between 1 and 500 results nothing changes and we can use the code from part 2. But if we have between 501 and 3000 results that code won’t work, since we can only retrieve 500 results at a time. We need to iterate over batches of 500 results if we have 501-3000 results. Here is some starter code.
iftotal>500:initial=1final=500batch=0whilefinal<=totalandfinal>=initial:batch+=1browser.find_element_by_css_selector('img[alt=\"Email Documents\"]').click()browser.switch_to_default_content()browser.switch_to_window(browser.window_handles[1])browser.find_element_by_xpath('//select[@id="sendAs"]/option[text()="Attachment"]').click()browser.find_element_by_xpath('//select[@id="delFmt"]/option[text()="Text"]').click()browser.find_element_by_name('emailTo').clear()browser.find_element_by_name('emailTo').send_keys(email)browser.find_element_by_id('emailNote').clear()browser.find_element_by_id('emailNote').send_keys('balloon')browser.find_element_by_id('sel').click()browser.find_element_by_id('rangetextbox').clear()browser.find_element_by_id('rangetextbox').send_keys('{}-{}'.format(initial,final))browser.find_element_by_css_selector('img[alt=\"Send\"]').click()try:element=WebDriverWait(browser,120).until(EC.element_to_be_clickable((By.CSS_SELECTOR,'img[alt=\"Close Window\"]')))exceptTimeoutException:log_errors.write('oops, TimeoutException when searching for balloon'+'\n')time.sleep(30)browser.close()initial+=500iffinal+500>total:final=totalelse:final+=500backwindow=browser.window_handles[0]browser.switch_to_window(backwindow)browser.switch_to_default_content()browser.switch_to_frame('mainFrame')framelist=browser.find_elements_by_xpath('//frame[contains(@name, 'fr_resultsNav')]')framename=framelist[0].get_attribute('name')browser.switch_to_frame(framename)
Lines 1-5 create the necessary accumulators and start the loop. Lines 16-18 fill out the “Select Items” form, which we didn’t need in part 2 (we had fewer than 500 results, so just selected “All Documents”). Line 19 clicks the “Send” button.
As before, once we click “Send” LexisNexis will shove the results into a text file and email it. Generating that file may take a while. The more so since we are now selecting 500 results, which is a lot. It may take a whole minute or so before the “Close Window” button finally appears on the pop-up.
That’s why we need the explicit wait you see in lines 20-25 (see part 3 if this is new to you). If the “Close Window” button takes over two minutes to appear we close the pop-up by brute force (and we hope that the file with the results was generated and sent correctly).
Lines 26-30 update the accumulators and lines 31-37 take us back to the results page.
Naturally this entire loop will be inside the big loop that iterates over your searches. I won’t give you any code for that outer loop, but really it’s simpler than the inner loop above.
Here I only used an explicit wait for the “Close Window” button but of course you will want to sprinkle explicit waits whenever you feel the need (i.e., whenever your code crashes while trying to locate an element or interact with it). Review part 3 if needed.
That’s about it for now. On the next post we will cover headless browsing and parallel webscraping.
In part 2 we learned how to handle dynamic names and how to download content with Selenium. Here we will learn how to make our code robust to network flukes.
handling errors
When you run a regression multiple times the result is always be the same, provided that the data and code you are using are the same. You run it a million times and there it is, same result. In other words, the result is deterministic.
With webscraping, however, the result is probabilistic. Sometimes a page element doesn’t load properly. Sometimes the servers are too busy to respond to a click. Sometimes your own internet connection flickers for a millisecond. And so on.
In LexisNexis, for instance, sometimes you get this:
In these cases Selenium will fail to find the elements you want and will crash. Selenium will throw out error messages like NoSuchElementException or NoSuchFrameException. If you’ve tried the code from parts 1 and 2 you may have encountered these errors already. It’s not that the code is wrong, it’s just that it is incomplete; we haven’t prepared it for network flukes. Let’s do it now.
One thing we can do is ensure that Selenium waits for a few seconds before it gives up on finding elements. There are different ways to do that. First there is the implicit wait statement:
browser.implicitly_wait(30)
This statement makes Selenium wait up to 30 seconds before throwing an exception. You set the time limit once in your code and it is valid for the entire session.
Alternatively, you can set individual wait parameters for each action. To do that we first need to import a bunch of other stuff from the Selenium bindings:
Now suppose that we want to wait for up to two minutes before we declare an element “missing”. Let’s say that the element is a button and that we know its CSS selector. We can do this:
Selenium will look for the element every 500 milliseconds and, as soon as the element is found, the wait is over. If 120 seconds elapse and the element hasn’t been found, Selenium throws a TimeoutException.
You need to decide what to do about the TimeoutException. Do you re-try a couple of times? Do you go back to the search page and move on to the next search? That of course depends on your particular research needs. But whatever path you choose you want your code to handle that exception gracefully. In Python that is done with try/except statements, like this:
try:some_object=WebDriverWait(browser,120).until(EC.element_to_be_located((By.CSS_SELECTOR,'img[alt=\"Some Button\"]')))exceptTimeoutException:# do something (retry, move on, exit, curse your internet provider, etc)
That way your code won’t crash when Selenium throws a TimeoutException. It will do whatever is inside the except statement instead.
Here we used the element_to_be_located condition, but that is not always what we need. Sometimes the element is located but cannot be interacted with (yet). Selenium offers wait conditions for several different possibilites. For instance, sometimes the element is located but Selenium crashes and the error message says that the element is not clickable. In that case we can do something like this:
try:some_object=WebDriverWait(browser,120).until(EC.element_to_be_clickable((By.CSS_SELECTOR,'img[alt=\"Some Button\"]')))exceptTimeoutException:# do something
Deciding what elements to (explicitly) wait for, with what conditions, and for how long is a trial-and-error process. Run your code without any waits first and see where it crashes. Add a wait condition for the problematic element, encapsulate the wait condition within a try/except statement, and run the code again. Repeat until your code doesn’t crash anymore.
This is often a frustrating process and you’ll need patience. You think that you’ve covered all the possibilities and your code runs for an entire week and you are all happy and celebratory and then on day #8 the damn thing crashes. The servers went down for a millisecond or your Netflix streaming clogged your internet connection or whatnot. It happens.
It’s always a good idea to log errors. You can create a log file in the beginning of your code, like this:
And then add an entry to that file every time you get a TimeoutException:
try:some_object=WebDriverWait(browser,120).until(EC.element_to_be_located((By.CSS_SELECTOR,'img[alt=\"Some Button\"]')))exceptTimeoutException:log_errors.write('couldnt locate button XYZ when searching for "balloon"'+'\n')# do something
Once your code has finished running you can inspect the log file and see what searches you need to re-do.