This the final part of our Selenium tutorial. Here we will cover headless browsing and parallel webscraping.
why switch to headless browsing
While you are building and testing your Selenium script you want to use a conventional browser like Chrome or Firefox. That way you can see whether the browser is doing what you want it do. You can see it clicking the buttons, filling out the text fields, etc. But once you are ready to release your bot into the wild you should consider switching to a headless browser. Headless browsers are browsers that you don’t see. All the action happens in the background; there are no windows.
Why would anyone want that? Two reasons.
First, you decrease the probability of errors. Your bot will be doing only one thing: interacting with the HTML code behind the website. Opening windows and rendering content can always result in errors, so why do these things now that your script is functional? This is especially the case if you are parallelizing your webscraping. You don’t want five or six simultaneous Chrome windows. That would make it five or six times more likely that you get a graphics-related error.
Second, it makes it easier to use remote computers. Like Amazon EC2, Google Compute Engine, or your university’s supercomputing center. Remote computers have no displays, the standard way to use them is via the command line interface. To use software that has a graphical user interface, like Chrome or Firefox, you need workarounds like X11 forwarding and virtual displays. Since a headless browser doesn’t have a graphical user interface you don’t need to worry about any of that. So why complicate things?
I know, it sounds weird to not see the action. But really, you’ve built and tested your script, you know it works, so now it’s time to get rid of the visual. In the words of Master Kenobi, “stretch out with your feelings”.
using a headless browser
There are several headless browsers, like ZombieJS, EnvJS, and PhantomJS (they tend to be written in JavaScript - hence the JS in the names). PhantomJS is the most popular, so that’s my pick (the more popular the tool the more documented and tested it is and the more people can answer your questions on StackOverflow).
Download PhantomJS and save the binary to the same folder where you downloaded chromedriver before. That binary contains both the browser itself and the Selenium driver, so there is no need to download a “phantomjsdriver” (actually they call it ghostdriver; it used to be a separate thing but now it’s just embedded in the PhantomJS binary, for your convenience).
You can start PhantomJS the same way you start Chrome.
path_to_phantomjs='/Users/yourname/Desktop/phantomjs'# change path as needed
browser=webdriver.PhantomJS(executable_path=path_to_phantomjs)
But you don’t want that. Websites know which browser you are using. And they know that humans use Chrome, Firefox, Safari, etc. But PhantomJS? ZombieJS? Only bots use these. Hence many websites refuse to deal with headless browsers. We want the website to think that we are using something else. To do that we need to change the user-agent string.
dcap=dict(DesiredCapabilities.PHANTOMJS)dcap["phantomjs.page.settings.userAgent"]=("Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/53 ""(KHTML, like Gecko) Chrome/15.0.87")
Now we start PhantomJS but using the modified user-agent string.
path_to_phantomjs='/Users/yourname/Desktop/phantomjs'# change path as needed
browser=webdriver.PhantomJS(executable_path=path_to_phantomjs,desired_capabilities=dcap)
And that’s it. The rest of your code doesn’t change. The only difference is that when you run your script you won’t see any windows popping up.
parallel webscraping
If you have tons of searches to do you may launch multiple instances of your code and have them run in parallel. Webscraping is a case of embarrassingly parallel problem, since no communication is required between the tasks. You don’t even need to write a parallel script, you can simply copy and paste your script into n different files and have each do 1/n of the searches. The math is simple: if you divide the work by two you get done in half the time, if you divide the work by 10 you get done in a tenth of the time, and so on.
That is all very tempting and our first instinct is to parallelize as much as possible. We want to launch hundreds or thousands of simultaneous searches and be done in only a fraction of the time. We want to build and deploy a big, scary army of bots.
But you shouldn’t do that, both for your own sake and for the sake of others.
First of all you don’t want to disrupt the website’s operation. Be especially considerate with websites that normally don’t attract a lot of vistors. They don’t have that many servers, so if you overparallelize you can cause the whole thing to shut down. This is really bad karma. Remember that other human beings also need that website - to make a living (if they are the ones running it) or for their own research needs (people like you). Be nice.
Second, if you overparallelize your bots may lose their human façade. Humans can only handle a couple of simultaneous searches, so if the webadmins see 300 simultaneous searches coming from the same IP address they will know that something is off. So even if you don’t bring the website down you can still get in trouble. You worked hard to put your Selenium script together, you don’t want your bots to lose their cover now.
So, proceed gently. How many parallel bots is safe? I have no idea. That depends entirely on the website, on how many IPs you are using, on whether you have proper delays between your searches (see part 4), etc. As a general rule I wouldn’t have more than three or four parallel bots per IP address (because that’s what a human being could manage).
Multiple IPs don’t mean you are safe though. If those IP addresses are all from the same place - say, your campus - then you are not fooling anyone. Suppose your university subscribes to LexisNexis and that, on average, about 500 people use it everyday on campus. Then you deploy 500 parallel bots. The LexisNexis webadmins will see that there is a sudden 100% increase in traffic coming from your campus. It doesn’t matter whether you are using 1 or 500 IP addresses.
If you get caught things can get ugly. LexisNexis, for instance, may cut off your entire university if they catch even one person webscraping. Imagine being the person who caused your entire university to lose access to LexisNexis. Or to Factiva. Or to any other major database on which thousands of people in your campus rely for their own research needs. Bad, bad karma.
By the way, LexisNexis’ terms of use explicitly prohibit webscraping; check item 2.2. It’s safe to have up to 3 or 4 parallel bots; as long as you have proper delays between searches (see part 4) no one can tell your 3-4 bots from a human being. Remember: Selenium doesn’t connect to the site directly but through the browser, so all that LexisNexis sees is Chrome or Firefox or Safari. But if you overparallelize then you may become the person who caused LexisNexis to cut off your entire university.
(A quick rant: item 2.2 is immoral. Universities pay millions of dollars to LexisNexis in subscriptions but then when we really need to use the content they say we can’t? That’s like selling 1000 gallons of water but delivering only one drop at a time. This is especially outrageous now that we have the hardware and software to text-mine millions of texts. The optimal answer here is buyers’ collusion but I don’t see that happening. So I see our webscraping LexisNexis as a legitimate act of civil disobedience.)
In sum, parallelizing can speed things up but you need to be careful and mindful of others when doing it.
This is it. I hope you have found this tutorial useful. There is a lot I haven’t covered, like interacting with hidden elements and executing arbitrary JavaScript code, but at this point you can figure out anything by yourself.
I wrote this tutorial because even though Selenium is perhaps the most powerful webscraping tool ever made it’s still somewhat unknown to webscrapers. If you found it useful as well then help spread the word.
I just came across this PyData talk on Python tools for parallelizing machine learning applications. It’s worth watching if you have tons of texts and doing things sequentially is just not working for you.
You can find other PyData talks here. (The ones from the last PyData NYC, held last weekend, are already there.)
R is a shockingly dreadful language for an exceptionally useful data analysis environment. The more you learn about the R language, the worse it will feel. The development environment suffers from literally decades of accretion of stupid hacks from a community containing, to a first-order approximation, zero software engineers. R makes me want to kick things almost every time I use it.
I’ve assumed that you know a bit of programming, so you are probably familiar with loops and conditional expressions. I won’t cover these (or any) general programming concepts, but I want to discuss two specific points. The first one is the importance of pacing your bot. The second one is how to iterate over searches on LexisNexis Academic. This second point is really about LexisNexis, not about webscraping in general, so you can safely skip it if that’s not the site you want to webscrape.
the importance of pacing
Your computer can fetch online content much faster than you can, so it’s tempting to just release the beast (i.e., your bot) into the wild and let it move full speed ahead. But that’s a dead giveaway. You want your bot to pass for a human but if it moves at blazing-fast speeds that may set off all kinds of alarms with the administrators of the website (or with the bots they’ve built to do detect enemy bots).
Hence you need to pace things. To do that just insert a time.sleep(seconds) statement between each iteration of the loop. Do a few searches manually first, see how long it takes, and use that information to set seconds in a way that slows your bot down to human speed.
Better yet: make seconds partially random. Something like this:
random.random() will generate a random number between 0 and 1. So we are randomizing the delay between searches, which will vary between 5 and 10 seconds. That looks a lot more like human activity than a uniform delay. Try doing 100 searches with exactly 5 seconds between them. You can’t. And if you can’t do it then you don’t want your bot to do it.
“Then why would I want to webscrape in the first place? If the bot can’t go faster than I can then what’s the point? I could simply manually fetch all the content I want.” You could and you should, if that’s at all feasible. Building a webscraping bot can take a couple of weeks, depending on the complexity of the website and on whether you have done this before. If fetching everything manually would take only a couple of minutes then there is little reason to do it programmatically.
Webscraping is for when fetching everything manually would take days or weeks or months. But even then you won’t necessarily be done any faster. It may still take weeks or months or years for your bot to do all the work (well, hopefully not years). The key point is: webscraping is not about finishing faster, it’s about freeing you to work on other, more interesting, tasks. While your bot is hard at work on LexisNexis or Factiva or any other site you are free to work on other parts of your dissertation, finish a conference paper, or binge-watch House of Cards on Netflix.
Also, fetching online content manually is error-prone. If you are doing it programmatically everything is transparent: you have the code, hence you know exactly what searches were performed. You can also log any errors, as we saw in part 3, so if something went wrong you will know all about it: day, hour, search expression, button clicked, etc.
But if you’re doing things manually how can you be sure that you did search for Congo Brazzaville and not for Congo Kinshasa instead? Imagine how tired and bored you will be by day #10. Do you really trust yourself not to make any typos? Or not to skip a search? You can hire undergrads to do the work, but if you can can make mistakes while doing it then imagine people who have no stake whatsoever in your research results.
So, even if your bot doesn’t go any faster than you would you will still be better off with it.
All that said, in part 5 (coming soon) we will see that you can actually make things go faster - if you have multiple bots. But that’s dangerous in a number of ways you need to know about all the dangers first. So hang in there.
looping over searches on LexisNexis Academic
Back in part 1 we submitted a search on LexisNexis Academic. We searched for all occurrences of the word “balloon” in the news that day. In part 2 we went to the results page and saw that there were 121 results. We then wrote some code to retrieve those 121 results.
That was all fine and dandy for introductory purposes but the thing is, that code only works when the number of results is between 1 and 500. If there are 0 results we don’t get the results page, we get this page instead:
Selenium will look for the ‘fr_resultsNav…’ frame (remember that?), won’t find it and will throw a NoSuchElementException.
Conversely, if there are over 3000 results we get this page instead:
Same as before: Selenium will look for the ‘fr_resultsNav…’ frame, won’t find it and will throw a NoSuchElementException.
Finally, if the number of results is between 501 and 3000 the code from part 2 will work fine up to the point where the “Download” or “Send” button is clicked (according to whether you are downloading the results or having LexisNexis email them to you). Then LexisNexis will give you an error message.
Yep, we can only retrieve 500 results at a time. The code from part 2 tries to download/email “All Documents”. But here we have 587 results, so we can’t do that.
You can see where this is going: you will need to branch your loop in order to account for those different scenarios.
Selenium-wise there is nothing new here so I won’t give you all the code, just pieces of it.
Scenario #1: no results
We need to locate the “No Documents Found” message that we get when there are no results. You already know how to find page elements (see part 1 if you don’t). But we can’t simply use browser.find_element_by_. If we do and we are not on the “no results” page Selenium will fail to find the “No Documents Found” message and the code will crash. Hence we need to encapsulate browser.find_element_by_ inside a try/except statement. If the “No Documents Found” element is found then we click “Edit Search” (top of the page) and move on to the next search. Otherwise we have one or more results and hence we need to move on to the results page.
Here’s some pseudocode for that (say we know the id of the element).
try:# element_id = id of "No Documents Found" element
browser.find_element_by_id(id)# click "Edit Search"
# move on to next search
exceptNoSuchElementException:# move on to the results page
This works. It’s not the most elegant solution though. Not hitting the “no results” page is not exactly an error. So it feels weird to treat it as such.
Selenium doesn’t have a “check if element exists” method, but we can emulate one. Something like this:
I know, we didn’t get rid of the try/except statement. But at least it is now contained inside a function and we don’t have to see it every time we need to check for some element’s existence based on its id.
You can write similar functions for other identifiers (name, xpath, etc). You can also write a more general function where you pass the identifier as an argument. Whatever suits your stylistic preferences.
You may want to log any searches that yield no results.
Scenario #2: more than 3000 results
This is similar to “no results” scenario, with only two differences. First, we need to look for the “More than 3000 Results” message (rather than the “No Documents Found” message); as before, we need to look for that message in a “safe” way, with a try/except statement or a user-defined function. Second, we have the option to go back and edit the search or proceed to the results page.
We can choose the latter by clicking the “Retrieve Results” button. But caution: when there are more than 3000 results the results page will only give us 1000 results. I don’t know what criteria LexisNexis uses to select those 1000 results (I asked them but they never bothered to reply my email). Depending on what you intend to do later you may want to consider issues of comparability and selection bias.
Scenario #3: 1-3000 results
If we are neither on the “no results” page nor on the “3000+ results” page then our first step is to retrieve the total number of results.
That number is contained in the totalDocsInResult object, as attribute “value”. Here is the object’s HTML code:
totalDocsInResult, in turn, is contained inside the fr_resultsNav... frame that we already saw in part 2, which in turn is inside ‘mainFrame’. We already know how to move into fr_resultsNav... (see part 2). Once we are there extracting the total number of results is straightforward.
(totalDocsInResult stores the number as a string, so we need to use int() to convert it to a number.)
If we have between 1 and 500 results nothing changes and we can use the code from part 2. But if we have between 501 and 3000 results that code won’t work, since we can only retrieve 500 results at a time. We need to iterate over batches of 500 results if we have 501-3000 results. Here is some starter code.
iftotal>500:initial=1final=500batch=0whilefinal<=totalandfinal>=initial:batch+=1browser.find_element_by_css_selector('img[alt=\"Email Documents\"]').click()browser.switch_to_default_content()browser.switch_to_window(browser.window_handles[1])browser.find_element_by_xpath('//select[@id="sendAs"]/option[text()="Attachment"]').click()browser.find_element_by_xpath('//select[@id="delFmt"]/option[text()="Text"]').click()browser.find_element_by_name('emailTo').clear()browser.find_element_by_name('emailTo').send_keys(email)browser.find_element_by_id('emailNote').clear()browser.find_element_by_id('emailNote').send_keys('balloon')browser.find_element_by_id('sel').click()browser.find_element_by_id('rangetextbox').clear()browser.find_element_by_id('rangetextbox').send_keys('{}-{}'.format(initial,final))browser.find_element_by_css_selector('img[alt=\"Send\"]').click()try:element=WebDriverWait(browser,120).until(EC.element_to_be_clickable((By.CSS_SELECTOR,'img[alt=\"Close Window\"]')))exceptTimeoutException:log_errors.write('oops, TimeoutException when searching for balloon'+'\n')time.sleep(30)browser.close()initial+=500iffinal+500>total:final=totalelse:final+=500backwindow=browser.window_handles[0]browser.switch_to_window(backwindow)browser.switch_to_default_content()browser.switch_to_frame('mainFrame')framelist=browser.find_elements_by_xpath('//frame[contains(@name, 'fr_resultsNav')]')framename=framelist[0].get_attribute('name')browser.switch_to_frame(framename)
Lines 1-5 create the necessary accumulators and start the loop. Lines 16-18 fill out the “Select Items” form, which we didn’t need in part 2 (we had fewer than 500 results, so just selected “All Documents”). Line 19 clicks the “Send” button.
As before, once we click “Send” LexisNexis will shove the results into a text file and email it. Generating that file may take a while. The more so since we are now selecting 500 results, which is a lot. It may take a whole minute or so before the “Close Window” button finally appears on the pop-up.
That’s why we need the explicit wait you see in lines 20-25 (see part 3 if this is new to you). If the “Close Window” button takes over two minutes to appear we close the pop-up by brute force (and we hope that the file with the results was generated and sent correctly).
Lines 26-30 update the accumulators and lines 31-37 take us back to the results page.
Naturally this entire loop will be inside the big loop that iterates over your searches. I won’t give you any code for that outer loop, but really it’s simpler than the inner loop above.
Here I only used an explicit wait for the “Close Window” button but of course you will want to sprinkle explicit waits whenever you feel the need (i.e., whenever your code crashes while trying to locate an element or interact with it). Review part 3 if needed.
That’s about it for now. On the next post we will cover headless browsing and parallel webscraping.