installing statistical tools on a remote Linux machine

If you have lots of data to analyze you may need to use a remote machine to get the computing power you need. Amazon Web Services is a popular choice, but their machines usually come “naked” - they have the operating system and not much else. Amazon does offer some software bundles, called Amazon Machine Images (AMIs), but these are more about business tools (SQL and NoSQL applications mainly) than about scientific computing tools. So normally you need to install all the software yourself. That can be a pain; you have to rely on the command line (no clicking on .dmg or .exe files), packages have complex dependency relations, you need to figure out what compilers you need, etc. So here is a step-by-step. I’m assuming a few things: you know (or can figure out on your own) how to fire up a remote machine (say, an Amazon EC2 instance); you know how to SSH into your remote machine; you know what a terminal is; you are not afraid of Linux; you will use Python for your scientific computing.

So you have fired up your remote machine and it is running some sort of Linux - say, Ubuntu. The first step then is to update and upgrade the pre-installed packages, which are almost surely old. SSH into the instance and run the following on the terminal:

sudo apt-get update
sudo apt-get upgrade

If you skip this step you may get in trouble later on: you will try to install your packages but you will get error messages saying that the packages couldn’t be found. That’s because of outdated repository information. So, make sure you always update and upgrade everything first.

If you have never seen these commands before, don’t worry. “apt-get” is a package manager that comes pre-installed on Ubuntu. You use it to install, update, and uninstall packages. Now, Ubuntu don’t just let anyone do whatever they want on the machine. If you try simply “apt-get update”, without the “sudo”, you get an error message saying that you don’t have permission to do that. Installing software is a potentially risky task, as it can break the system, create security holes, etc, so only a “super user” can do it. That’s why you need to put “sudo” before “apt-get”: “sudo” stands for “super user do”; it’s a way to tell Ubuntu to treat you as a “super user” and let you do whatever you want. On a local machine you would be prompted for your password, but on Amazon instances there is no such thing. So, that’s all that’s going on here, nothing of particular importance. If you are using a different Linux distribution - say, Fedora, Red Hat, or CentOS - the equivalent of “apt-get” is probably “yum”.

Next you need to install some low-level stuff that will be required by other, higher-level packages later on.

sudo apt-get install build-essential libc6 gcc gfortran

You now have the basic tools to compile packages from source. “Compile from source…?” Not to worry if you’ve never heard of this before. Many packages are easy to install: you just do “sudo apt-get install XXXXX” and that’s it, package XXXXX is installed; the binary files - i.e., the executable files - are generated automatically. Other packages, however, can’t be installed via “apt-get” and require instead that you get the source code and generate the binaries yourself. That’s what compiling is: transforming source code into executable files. To do that we need to install a compiler first. That’s what you just did above: you installed gcc, which can compile C code, and gfortran, which can compile Fortran code (and a few other helpful tools as well).

Now on to some statistical tools: BLAS and LAPACK. These are linear algebra packages. They have routines for solving linear equations, matrix factorization, least squares, etc. You won’t interact with BLAS and LAPACK directly though. They are low-level tools. They are intended to be used by other programs - like Python or C or R –, not by humans. (Well, you can write BLAS and LAPACK routines directly but you probably don’t want to.) You want to install BLAS and LAPACK because the programs you do interact with, like Python or R, can run faster if BLAS and LAPACK are present.

You also want to install ATLAS. ATLAS is a “bundle” of BLAS and (parts of) LAPACK but optimized for your specific machine. In other words, it’s BLAS and LAPACK but faster. Thus, in theory, we either install BLAS and LAPACK or we install ATLAS. In practice, however, some packages look specifically for BLAS/LAPACK while others look specifically for ATLAS, so it’s best to install all three.

BLAS and LAPACK first:

sudo apt-get install libblas3gf liblapack3gf

Then ATLAS. This one is actually scattered across different packages:

sudo apt-get install libatlas-base-dev
sudo apt-get install libatlas-dev
sudo apt-get install libatlas-doc
sudo apt-get install libatlas-test
sudo apt-get install libatlas3gf-base

That’s it.

If you really want to make the most out of your machine you may consider installing Intel’s Math Kernel Library (MKL) as well. It’s built on top of BLAS and LAPACK, like ATLAS, but adds many routines of its own. The benchmarks look really good, especially on multi-core machines; MKL gets increasingly faster than ATLAS the more cores you have (though there are lies, damn lies, and software performance benchmarks…).

Unlike everything else in this tutorial MKL is not open source. It costs $499 by itself, but to get it to work you also need Intel’s C++ compiler, so you will need a bundle like Intel Composer, which includes everything and costs $1,199. Licenses for non-commercial use are free though. If that applies to you then what you need to do is download the installer, run “sudo bash install.sh”, and follow the instructions.

Installing Intel Composer is straightforward, but by itself it won’t do you any good. You need to link your high-level tools - R, NumPy, etc - to MKL. I.e., you need to tell your statistical packages that they are supposed to use MKL. And there is no generic formula for that, it differs for every package and system. The only constant is that the process always involves banging your head against the wall. You need to figure out how to set the environment variables correctly so that the system can find your Intel compilers. You need to figure out what flags to use during the build phase. You need to read through dozens of pages of installation logs to figure out what went wrong. It may take days and it is not for the faint of heart.

Moving on.

Here are a few tools needed to build some Python packages:

sudo apt-get install python-dev python-nose cython

In case you are curious, python-dev contains header files and static libraries needed for code distribution; python-nose is package for testing Python code; and cython allows you to mix some C code in your Python scripts.

Now on to HDF5. HDF5 is a suite of packages that let you store data in h5 format. The h5 format is optimized for loading speed: it takes more disk space than other formats (say, csv), but on the other hand it loads from disk to memory much faster. I think the trade-off is worth it; disk space is cheap (under $0.10/GB/month on Amazon), but powerful computers aren’t (up to $4.8/hour on Amazon), so it’s better to minimize computing time than to minimize disk usage.

Moreover, as Francesc Alted alerts us, the bottleneck in scientific computing is no longer CPU power but input/output (IO) speed. Our CPUs have become much faster but our disks only a bit faster; hence there is often a huge gap between what the CPU can do and what the disk allows it to do (the “starving CPU” problem). In my own research switching to h5 files reduced loading time in half. That is imaterial if your model runs in milliseconds, but in the world of big data your model may take hours or days to run, so halving the time makes a lot of difference.

To get HDF5 all you need to do is download the files and move them to the appropriate folders. No need to “sudo apt-get…” anything or compile anything from source. Just do this:

wget http://www.hdfgroup.org/ftp/HDF5/current/bin/linux-x86_64/hdf5-1.8.12-linux-x86_64-shared.tar.gz
tar xvfz hdf5-1.8.12-linux-x86_64-shared.tar.gz
cd hdf5-1.8.12-linux-x86_64-shared
cd bin
sudo cp * /usr/bin
cd ..
cd lib
sudo cp * /usr/lib
cd ..
cd include
sudo cp * /usr/include
cd ..
cd share
sudo cp -a * /usr/share
cd ..

In case you’ve never seen it before, the “tar” command is similar to the “unzip” command on Windows. The .tar.gz extension indicates that the file is in a “tar ball” format and the “tar” command “untars” it. (Now, about that “xvfz” option…)

</img>

That’s it. There is really no actual installation involved with HDF5 here, you just download stuff and move it around. You won’t use HDF5 directly. You will install PyTables, which is the Python module that communicates with HDF5 and enables Python to handle h5 files. But PyTables is still a bit low-level, so on top of that you will install pandas, which uses PyTables to let you read and write h5 files in an easy way. More on these modules later.

(Take a moment to appreciate how lucky you are. If you were on a Mac you would need to compile the source code manually, as there are no HDF5 binaries for Mac, and you would need to read through eight single-spaced pages of instructions about which compilers to use, which flags are appropriate in each case, etc.)

Let’s move on to installing Python’s own packages. The first one needs to be setuptools, as some packages will rely on it.

wget https://bitbucket.org/pypa/setuptools/raw/bootstrap/ez_setup.py
sudo python ez_setup.py

Now on to NumPy, which gives us basic math (matrix and vector operations mainly), and SciPy, which gives us advanced math (regression, optimization, distributions, advanced linear algebra, etc). There are multiple ways to install these two libraries. If you are in a hurry you can simply do

sudo apt-get install python-numpy pyhton-scipy

But that’s bad. You get old versions of NumPy and Scipy. NumPy is currently in version 1.8.0 but with apt-get you get version 1.6.1. Scipy is currently in version 0.13 but with apt-get you get version 0.9. You miss bug fixes and new functionalities and you risk having conflicts with other packages that may rely on newer versions of NumPy and SciPy.

So what I recommend is downloading the source code for NumPy and SciPy and using setup.py to install. In fact you can do this to install practically any Python package. Here is generic formula:

wget http://www.url_of_the_source_code/sourcecode.tar.gf
tar xvfz sourcecode.tar.gf
cd sourcecode
python setup.py build # "sudo" is usually not necessary to build the package; in fact, it can break the build
sudo python setup.py install

And that’s it, almost any Python package can be installed this way. I would recommend creating a “src” directory first and doing everything in there, to avoid polluting your home directory.

mkdir src
cd src

Now that’s only a generic formula, here is what you need to do to install the current versions of NumPy and SciPy like that:

# NumPy first
wget https://pypi.python.org/packages/source/n/numpy/numpy-1.8.0.tar.gz
tar xvfz numpy-1.8.0.tar.gz
cd numpy-1.8.0
python setup.py build
sudo python setup.py install
cd ..

# then SciPy
wget https://pypi.python.org/packages/source/s/scipy/scipy-0.13.2.tar.gz
tar xvfz scipy-0.13.2.tar.gz
cd scipy-0.13.2
python setup.py build
sudo python setup.py install
cd ..

And voilà, you have NumPy and SciPy.

Now, here is the thing: if you want to use Intel’s MKL then the procedure above won’t work. I mean, NumPy and SciPy will be installed, but they will not be linked to MKL. If you want to use MKL then you will need to make some changes to some of the files in the numpy-1.8.0 and scipy-0.13.2 directories before you do “python setup.py build”. What changes? Well, that’s the million-dollar question. The changes are intended to set what compiler to use, what optimization level, where your MKL/BLAS/LAPACK/ATLAS libraries are, etc. A lot depends on the details of your machine and system, so I can’t give you a step-by-step here. There are Intel directions and NumPy directions about how to do all that, but those directions just won’t work. As I said before, linking to MKL invariably involves much aggravation.

Ok, moving on to other packages.

Of all Python packages we are going to install only two are best installed by other methods. Let’s get them out of the way now:

# scikit-learn (Python's machine-learning package)
sudo easy_install -U scikit-learn

# pytz (handles time zones)
wget https://pypi.python.org/packages/3.3/p/pytz/pytz-2013.8-py3.3.egg
sudo easy_install pytz-2013.8-py3.3.egg

For all other packages we just follow the formula:

# numExpr (optimized array operations and support for multithreaded operations)
# these instructions don't apply if you want MKL; same as with NumPy and SciPy: must link numExpr explicitly
wget https://numexpr.googlecode.com/files/numexpr-2.2.2.tar.gz
tar xvfz numexpr-2.2.2.tar.gz
cd numexpr-2.2.2
python setup.py build
sudo python setup.py install
cd ..

# Bottleneck (optimized NumPy functions, written in Cython)
wget https://pypi.python.org/packages/source/B/Bottleneck/Bottleneck-0.7.0.tar.gz
tar xvfz Bottleneck-0.7.0.tar.gz
cd Bottleneck-0.7.0
python setup.py build
sudo python setup.py install
cd ..

# PyTables (reads h5 files)
wget http://downloads.sourceforge.net/project/pytables/pytables/3.0.0/tables-3.0.0.tar.gz
tar xvfz tables-3.0.0.tar.gz
cd tables-3.0.0
python setup.py build
sudo python setup.py install
cd ..

# dateutil (extensions to Python's datetime module)
wget https://pypi.python.org/packages/source/p/python-dateutil/python-dateutil-2.2.tar.gz
tar xvfz python-dateutil-2.2.tar.gz
cd python-dateutil-2.2
python setup.py build
sudo python setup.py install
cd ..

# pandas (data management tools and matrix operations)
wget https://pypi.python.org/packages/source/p/pandas/pandas-0.12.0.tar.gz
tar xvfz pandas-0.12.0.tar.gz
cd pandas-0.12.0
python setup.py build
sudo python setup.py install
cd ..

# patsy (creates "formulas" in an R-like way)
wget https://pypi.python.org/packages/source/p/patsy/patsy-0.2.1.tar.gz
tar xvfz patsy-0.2.1.tar.gz
cd patsy-0.2.1
python setup.py build
sudo python setup.py install
cd ..

# statsmodels (exploratory data analysis and some statistical tools)
wget https://pypi.python.org/packages/source/s/statsmodels/statsmodels-0.5.0.tar.gz
tar xvfz statsmodels-0.5.0.tar.gz
cd statsmodels-0.5.0
python setup.py build
sudo python setup.py install
cd ..

These packages have dependency relations, so order matters here. For instance, pandas depends on NumPy, pytz, numExpr, dateutil, Bottleneck, and PyTables; PyTables depends on the HDF5 files we downloaded before; NumPy and SciPy depend on the BLAS/LAPACK/ATLAS packages we installed in the beginning; statsmodels depends on patsy; and so on. Stick to the order you see above.

It’s always a good idea to test each Python package after the installation. Most of the above packages have a “.test()” method. You can use it like this:

python -c "import package_name; package_name.test()"

(You need to leave the installation directory - the directory where the “setup.py” file is - first or you get an error.)

The “.test()” method runs hundreds (sometimes thousands) of automated tests. Invariably a few of them will fail or at the very least there will be warnings. Pay attention to the failures and warnings and try to figure out whether they are related to methods or classes that you might need in your work. If you are unsure, design and run a few tests of your own, to see if you should expect any problems later. If you do run into issues then Stackoverflow and Google are your friends. If you are impatient, try installing a slightly older version of the package (although of course that may create problems of its own).

That’s it, you are now ready to start working. If you are using Amazon EC2 you can save everything you just did as an Amazon Machine Image (AMI). That way you don’t have to do everything again when you launch a new EC2 instance.

If you found all this too much work, you may want to look into Anaconda or Enthought Canopy. These are bundles that include everything I discussed above in a self-contained environment, so you don’t need to install anything; the tools are already there, even the non-Python packages like MKL (and yes, everything is already linked to MKL), so you can just use them out-of-the-box.

Unfortunately Anaconda and Canopy have their own problems. I used Anaconda for a bit but after an OS upgrade MKL just stopped working altogether (and I couldn’t disable it, so every time NumPy or pandas called MKL my code broke). Canopy, on the other hand, is not well-suited for the command-line interface, despite what the documentation might lead you to believe. I tried to set up a user environment on two different remote machines (one Ubuntu and one Red Hat) and all I got was error messages about missing libraries that I couldn’t find anywhere. So in the end I got fed up and decided to do things the hard way.

Parallella

“Packing impressive supercomputing power inside a small credit card-sized board running Ubuntu, Adapteva’s $99 ARM-based Parallella system includes the unique Ephiphany numerical accelerator that promises to unleash industrial strength parallel processing on the desktop at a rock-bottom price.”

From here.

Sounds exciting. And it supports Python.

ggplot for Python

Here.

Another step in the Pythonification of the world.

"I've been using Python for 3 years and I've never defined a class"

Here.

I guess that’s the case for many of us who use general-purpose languages (like Python) to do math. When I only knew R and Stata my typical script was a bunch of math expressions, one after the other, some of them put inside functions. And that seemed to work fine. Then I started learning Python and noticed that that’s not what textbook scripts look like. They don’t have functions hanging around by themselves; everything is organized around classes and subclasses.

Realizing the huge gap between the scripts I wrote and the ones on textbooks was a bit demoralizing. Edsger Dijkstra once famously said that “It is practically impossible to teach good programming to students that have had a prior exposure to BASIC: as potential programmers they are mentally mutilated beyond hope of regeneration.” I thought that maybe R and Stata were my BASIC, that learning them had destroyed any chances that I would ever write decent code. (Though I have also learned some BASIC, as a kid, so perhaps I was already mentally mutilated by the time I encountered R and Stata).

(This may have impaired my programming skills but I enjoyed tweaking Nibbles. And using goto)

Turns out I made too much of it. Yes, my scripts were ugly and I shouldn’t let things happen outside functions and I shouldn’t have global variables. I firmly intend to go back and refactor my Wordscores and “Fightin’ Words” implementations. But if you are doing math and your script is small and you won’t integrate it into a larger project, then not everything needs be organized around classes.

As someone in the thread above noted, writing math code is fundamentally different than writing a desktop application.

In a program designed to calculate a result, an input gets heavily processed through several phases of computation. It might be useful to define classes here, and it might not: I could see cases where heterogenous input needed to be processed along a common axis, but with the heterogenous identity preserved. Named tuples or classes might serve this goal well. But the concept of the program is essentially a stream or blob of data that moves through a processing pipeline until an intermediate or final result is produced. In a program designed to provide a set of interactive workflows, you can envision the data instead as a set of related data objects – documents perhaps, or account credentials, or card catalog records, or all of the above – upon which the program performs operations at the request of the user. The objects are like a lattice of interrelated data, waiting to be tickled or touched in exactly the right way. Classes fit this paradigm perfectly.

We also don’t get many opportunities to define our own classes when we are doing math. Need a matrix? There is the NumPy array class. Need a time series? There is the pandas Series class. Need a matrix that does relational joins fast? There is the pandas DataFrame class. Math-wise there is a lot of work already done for us. It would be silly to reinvent the wheel. True, it would be great if someone created a sparse matrix class that allowed relational joins (neither scipy.sparse nor pandas.SparseDataFrame do). But defining classes cannot be an end in itself.

Maybe that applies to non-math code as well.

Granted, using a language like Python or Ruby without defining classes may feel weird. It feels like we are not doing actual object-oriented programming. And we are not quite doing functional programming either, since we mutate data, our functions have side-effects, and most of us don’t use lambda calculus. So we are in a sort of programming limbo. Perhaps we should just embrace the functional paradigm wholeheartedly and switch to Haskell or Scheme. Alas, that is hard to do after learning to love Python or Ruby, what with their intuitive syntaxes and extensive libraries, and so many people to answer our questions on StackOverflow. So maybe we should just learn to love the limbo.

webscraping with Selenium - part 5

This the final part of our Selenium tutorial. Here we will cover headless browsing and parallel webscraping.

why switch to headless browsing

While you are building and testing your Selenium script you want to use a conventional browser like Chrome or Firefox. That way you can see whether the browser is doing what you want it do. You can see it clicking the buttons, filling out the text fields, etc. But once you are ready to release your bot into the wild you should consider switching to a headless browser. Headless browsers are browsers that you don’t see. All the action happens in the background; there are no windows.

Why would anyone want that? Two reasons.

First, you decrease the probability of errors. Your bot will be doing only one thing: interacting with the HTML code behind the website. Opening windows and rendering content can always result in errors, so why do these things now that your script is functional? This is especially the case if you are parallelizing your webscraping. You don’t want five or six simultaneous Chrome windows. That would make it five or six times more likely that you get a graphics-related error.

Second, it makes it easier to use remote computers. Like Amazon EC2, Google Compute Engine, or your university’s supercomputing center. Remote computers have no displays, the standard way to use them is via the command line interface. To use software that has a graphical user interface, like Chrome or Firefox, you need workarounds like X11 forwarding and virtual displays. Since a headless browser doesn’t have a graphical user interface you don’t need to worry about any of that. So why complicate things?

I know, it sounds weird to not see the action. But really, you’ve built and tested your script, you know it works, so now it’s time to get rid of the visual. In the words of Master Kenobi, “stretch out with your feelings”.

using a headless browser

There are several headless browsers, like ZombieJS, EnvJS, and PhantomJS (they tend to be written in JavaScript - hence the JS in the names). PhantomJS is the most popular, so that’s my pick (the more popular the tool the more documented and tested it is and the more people can answer your questions on StackOverflow).

Download PhantomJS and save the binary to the same folder where you downloaded chromedriver before. That binary contains both the browser itself and the Selenium driver, so there is no need to download a “phantomjsdriver” (actually they call it ghostdriver; it used to be a separate thing but now it’s just embedded in the PhantomJS binary, for your convenience).

You can start PhantomJS the same way you start Chrome.

path_to_phantomjs = '/Users/yourname/Desktop/phantomjs' # change path as needed
browser = webdriver.PhantomJS(executable_path = path_to_phantomjs)

But you don’t want that. Websites know which browser you are using. And they know that humans use Chrome, Firefox, Safari, etc. But PhantomJS? ZombieJS? Only bots use these. Hence many websites refuse to deal with headless browsers. We want the website to think that we are using something else. To do that we need to change the user-agent string.

dcap = dict(DesiredCapabilities.PHANTOMJS)
dcap["phantomjs.page.settings.userAgent"] = (
     "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/53 "
     "(KHTML, like Gecko) Chrome/15.0.87")

Now we start PhantomJS but using the modified user-agent string.

path_to_phantomjs = '/Users/yourname/Desktop/phantomjs' # change path as needed
browser = webdriver.PhantomJS(executable_path = path_to_phantomjs, desired_capabilities = dcap)

And that’s it. The rest of your code doesn’t change. The only difference is that when you run your script you won’t see any windows popping up.

parallel webscraping

If you have tons of searches to do you may launch multiple instances of your code and have them run in parallel. Webscraping is a case of embarrassingly parallel problem, since no communication is required between the tasks. You don’t even need to write a parallel script, you can simply copy and paste your script into n different files and have each do 1/n of the searches. The math is simple: if you divide the work by two you get done in half the time, if you divide the work by 10 you get done in a tenth of the time, and so on.

That is all very tempting and our first instinct is to parallelize as much as possible. We want to launch hundreds or thousands of simultaneous searches and be done in only a fraction of the time. We want to build and deploy a big, scary army of bots.

But you shouldn’t do that, both for your own sake and for the sake of others.

First of all you don’t want to disrupt the website’s operation. Be especially considerate with websites that normally don’t attract a lot of vistors. They don’t have that many servers, so if you overparallelize you can cause the whole thing to shut down. This is really bad karma. Remember that other human beings also need that website - to make a living (if they are the ones running it) or for their own research needs (people like you). Be nice.

Second, if you overparallelize your bots may lose their human façade. Humans can only handle a couple of simultaneous searches, so if the webadmins see 300 simultaneous searches coming from the same IP address they will know that something is off. So even if you don’t bring the website down you can still get in trouble. You worked hard to put your Selenium script together, you don’t want your bots to lose their cover now.

So, proceed gently. How many parallel bots is safe? I have no idea. That depends entirely on the website, on how many IPs you are using, on whether you have proper delays between your searches (see part 4), etc. As a general rule I wouldn’t have more than three or four parallel bots per IP address (because that’s what a human being could manage).

Multiple IPs don’t mean you are safe though. If those IP addresses are all from the same place - say, your campus - then you are not fooling anyone. Suppose your university subscribes to LexisNexis and that, on average, about 500 people use it everyday on campus. Then you deploy 500 parallel bots. The LexisNexis webadmins will see that there is a sudden 100% increase in traffic coming from your campus. It doesn’t matter whether you are using 1 or 500 IP addresses.

If you get caught things can get ugly. LexisNexis, for instance, may cut off your entire university if they catch even one person webscraping. Imagine being the person who caused your entire university to lose access to LexisNexis. Or to Factiva. Or to any other major database on which thousands of people in your campus rely for their own research needs. Bad, bad karma.

By the way, LexisNexis’ terms of use explicitly prohibit webscraping; check item 2.2. It’s safe to have up to 3 or 4 parallel bots; as long as you have proper delays between searches (see part 4) no one can tell your 3-4 bots from a human being. Remember: Selenium doesn’t connect to the site directly but through the browser, so all that LexisNexis sees is Chrome or Firefox or Safari. But if you overparallelize then you may become the person who caused LexisNexis to cut off your entire university.

(A quick rant: item 2.2 is immoral. Universities pay millions of dollars to LexisNexis in subscriptions but then when we really need to use the content they say we can’t? That’s like selling 1000 gallons of water but delivering only one drop at a time. This is especially outrageous now that we have the hardware and software to text-mine millions of texts. The optimal answer here is buyers’ collusion but I don’t see that happening. So I see our webscraping LexisNexis as a legitimate act of civil disobedience.)

In sum, parallelizing can speed things up but you need to be careful and mindful of others when doing it.


This is it. I hope you have found this tutorial useful. There is a lot I haven’t covered, like interacting with hidden elements and executing arbitrary JavaScript code, but at this point you can figure out anything by yourself.

I wrote this tutorial because even though Selenium is perhaps the most powerful webscraping tool ever made it’s still somewhat unknown to webscrapers. If you found it useful as well then help spread the word.

Comments are always welcome.

Happy webscraping!