Building A Concurrent Web Scraper With Haskell

Written March 10, 2012

updated: April 16, 2012

Introduction

Let's make a concurrent web scraper! We will use Haskell, because it allows easy concurrency. We will use the HXT library to do the scraping. If you want to follow the HXT bits, you should be comfortable with Arrows in Haskell. If you're not, take a moment to read up on Arrows.

If you don't care about the scraping bits, jump straight to the concurrency section.

Installation

Make sure you have the hxt, url and http packages:

cabal install hxt
cabal install url
cabal install http
cabal install maybet

Basic Setup

First, let's write some basic functions to make life easier for ourselves:

openUrl :: String -> MaybeT IO String
openUrl url = case parseURI url of
    Nothing -> fail ""
    Just u  -> liftIO (getResponseBody =<< simpleHTTP (mkRequest GET u))

css :: ArrowXml a => String -> a XmlTree XmlTree
css tag = multi (hasName tag)

get :: String -> IO (IOSArrow XmlTree (NTree XNode))
get url = do
  contents <- runMaybeT $ openUrl url
  return $ readString [withParseHTML yes, withWarnings no] (fromMaybe "" contents)

I say basic because they will be our building blocks, not because they are easy :P Let's see how they work.

openUrl is a function that will download a web page for us. It returns a MaybeT Monad Transformer. We can use it like

contents <- runMaybeT $ openUrl "http://example.com"

and contents will be a Just if the operation was successful, or Nothing otherwise.

css will allow us to use css selectors on the downloaded page.

get is where things get interesting. First, we download a page using

contents <- runMaybeT $ openUrl url

like we talked about. Next, we parse the page using HXT:

readString [withParseHTML yes, withWarnings no] contents

readString takes some options as its first parameter:

withParseHTML: Parse as HTML, which makes sure the parser doesn't break on things like the doctype.
withWarnings: Prints out warnings about malformed html if it's switched on. Since so much of the web is malformed html, I switched it off :P

Now we are ready to start.

Image Downloader

Let's write something that downloads all the images from a given page.

First, let's get a parsed page:

main = do
    page <- get "http://www.reddit.com/r/pics"

page is now an Arrow. We can run this Arrow at any time to get its value by using runX. Let's try it now:

ghci>runX page
[NTree (XTag "/" [NTree (XAttr "transfer-Status") [NTree (XText "200") []],NTree (XAttr "transfer-Message") [NTree (XText "OK") []],NTree (XAttr "transfer-URI") [NTree (XText "string:") []],NTree (XAttr "source") [NTree (XText "\"<!DOCTYPE html PUBLIC \\\"-//W3C//DTD XHTML 1.0 ...\"") []],NTree (XAttr "transfer-Encoding") [NTree (XText "UNICODE") []],NTree (XAttr "doctype-name")
(...many lines skipped...)

Wow, that looks confusing. Let's select only what we want. Get just the images:

ghci>runX $ page >>> css "img"
[NTree (XTag "img" [NTree (XAttr "id") [NTree (XText "header-img") []],NTree (XAttr "src") [NTree (XText "http://f.thumbs.redditmedia.com/nDSO6j0fVaKEV5Hw.png") []]
...

Aha! Much nicer. Now let's get just the src's:

ghci>runX $ page >>> css "img" >>> getAttrValue "src"
["http://f.thumbs.redditmedia.com/nDSO6j0fVaKEV5Hw.png","http://pixel.redditmedia.com/pixel/of_defenestration.png"
...

Done! That was easy. Now all we need to do is download these images and save them to disk.

main = do
  url <- parseArgs
  doc <- get url
  imgs <- runX . images $ doc
  sequence_ $ map download imgs

The first three lines of our main function get a list of links. The images function is very simple:

images tree = tree >>> css "img" >>> getAttrValue "src"

It gets a list of all the image sources, just like we had talked about.

The fourth lines maps the download function over this list to create a list of IO actions. Then we feed that list into sequence_, which runs the actions one at a time and throws away the return values. We could have used sequence instead, which would have printed the return values.

Here's the download function:

download url = do
  content <- runMaybeT $ openUrl url
  case content of
       Nothing -> putStrLn $ "bad url: " ++ url
       Just _content -> do
          let name = tail . uriPath . fromJust . parseURI $ url
          B.writeFile name (B.pack _content)

We have to write out binary data, so we use the writeFile defined in Data.ByteString.Char8, which operates on ByteStrings. This is why we need to convert our String to a ByteString first using B.pack. We are also able to do error checking thanks to our openUrl function being a MaybeT. If we didn't get any content, we just print out "bad url: [url]". Otherwise we download the image.

Concurrency

After all that work, the concurrent bit seems almost anti-climactic.

First, install the parallel-io package:

cabal install parallel-io

Import it into the script:

import Control.Concurrent.ParallelIO

ParallelIO defines a new function called parallel_ which we can use anywhere we would have used sequence_. The IO actions will then get performed concurrently. Change the end of the script to this:

...
imgs <- runX . images $ doc
parallel_ $ map download imgs
stopGlobalPool

stopGlobalPool needs to be called after the last use of a global parallelism combinator. It cleans up the thread pool before shutdown.

Now build the concurrent version (enabling runtime system options):

$ ghc --make grabber_par.hs -threaded -rtsopts

And run it with +RTS -N[number of threads]:

$ ./grabber_par +RTS -N4

Results

Here's how the two versions performed on my machine:

Without parallelization:

$ time ./grabber "http://www.reddit.com/r/pics"

real  0m10.341s
user  0m0.203s
sys  0m0.048s

With parallelization (four threads):

$ time ./grabber_par "http://www.reddit.com/r/pics" +RTS -N4

real  0m3.490s
user  0m0.477s
sys  0m0.154s

Almost a third of the time!

The ParallelIO library uses MVars to keep things in sync. Read more about ParallelIO or MVars.

Next Steps

Next steps involve writing this as a crawler that visits links on the page up to a depth of N as well as some way to keep track of visited pages. We also want to keep track of name collisions. If you try to download two images, both named "test.jpg", the concurrent version will error out. The non-concurrent version would just overwrite one image with another, which isn't any good either. On the crawling side, we should watch out for robots.txt files and META tag directives to be polite. And ask for gzip'd data to reduce request time.

We could also parallelize more than just the download, but its a start!

Full Code Listing

import qualified Data.ByteString.Char8 as B
import Data.Tree.NTree.TypeDefs
import Data.Maybe
import Text.XML.HXT.Core
import Control.Monad
import Control.Monad.Trans
import Control.Monad.Maybe
import Network.HTTP
import Network.URI
import System.Environment
import Control.Concurrent.ParallelIO

-- helper function for getting page content
openUrl :: String -> MaybeT IO String
openUrl url = case parseURI url of
    Nothing -> fail ""
    Just u  -> liftIO (getResponseBody =<< simpleHTTP (mkRequest GET u))

css :: ArrowXml a => String -> a XmlTree XmlTree
css tag = multi (hasName tag)

get :: String -> IO (IOSArrow XmlTree (NTree XNode))
get url = do
  contents <- runMaybeT $ openUrl url
  return $ readString [withParseHTML yes, withWarnings no] (fromMaybe "" contents)

images tree = tree >>> css "img" >>> getAttrValue "src"

parseArgs = do
  args <- getArgs
  case args of
       (url:[]) -> return url
       otherwise -> error "usage: grabber [url]"

download url = do
  content <- runMaybeT $ openUrl url
  case content of
       Nothing -> putStrLn $ "bad url: " ++ url
       Just _content -> do
          let name = tail . uriPath . fromJust . parseURI $ url
          B.writeFile name (B.pack _content)

main = do
  url <- parseArgs
  doc <- get url
  imgs <- runX . images $ doc
  parallel_ $ map download imgs
  stopGlobalPool

Contents