How to create a simple parallelized web scraper using Ruby and Async

Web scraping is indeed a fascinating aspect of automation, and Ruby makes it even more fun thanks to its rich library. Today we're going to explore how to build a simple web scraper using Async in Ruby.

Installing

First of all, you will need the async and async-http gems. You can easily install these gems by typing them into your command line interface:

gem install async async-http

Getting Started with the Web Scraper

Let's get to the action part. The following is the core code snippet that has been provided:

require 'async'
require 'async/barrier'
require 'async/semaphore'
require 'async/http/internet'

URLS = ["https://forum.rubyonrails.pl/",
        "https://forum.rubyonrails.pl/t/jak-zaczac-przygode-z-ruby-on-rails/18"]

Async do |task|
  internet = Async::HTTP::Internet.new
  barrier = Async::Barrier.new
  semaphore = Async::Semaphore.new(8, parent: barrier)
  titles = []
  urls = []

  URLS.each do |url|
    semaphore.async do
      task.with_timeout(3) do
        response = internet.get(URI.parse(url)).read
        titles << response.scan(/<title>([^<]+)<\/title>/).first.first 
        urls << response.scan(/<a href="([^"]+)"/) 
      end
    end
  end

  barrier.wait

  puts "#### TITLES ####"
  puts titles
  puts "#### URLS ####"
  puts urls.uniq.sort
rescue Async::TimeoutError
  puts "Timeout"
ensure
  internet&.close
end

To scrape a website, we start by ensuring that async is required, followed by the specific async modules we'll be using, namely Barrier, Semaphore and HTTP::internet. Then we define the list of URLs we will be scraping. The Async block is the heart of the code. In this block, we start by setting up the Internet session using Async::HTTP::Internet.new. We then create an Async::Barrier and an Async::Semaphore which allows us to control concurrent requests.
We loop through the URL array and create an asynchronous task for each URL. Within each of these tasks, we create a timeout limit of 3 seconds to ensure that our script doesn't hang if a page takes too long to respond.
We then make the HTTP request to each page using the internet.get method. The response variable contains the raw HTML content of each URL visited, which we then parse for the web page titles and for all the URLs present on that page.
When all the requests are complete, we print the retrieved web page titles and URLs to the console. If the request takes longer than 3 seconds, the Async::TimeoutError saves us by outputting Timeout and ensuring that the internet session is closed.

Summary

As you see, creating a web scraper with Ruby and Async is quite efficient and elegant due to the asynchronous nature of the tasks. The Async library in Ruby holds great potential for concurrency, making it a choice for tasks that deal with operations such as web scraping.

Happy scraping!