How to create a simple parallelized web scraper using Ruby and Async
Web scraping is indeed a fascinating aspect of automation, and Ruby makes it even more fun thanks to its rich library. Today we're going to explore how to build a simple web scraper using Async
in Ruby.
Installing
First of all, you will need the async
and async-http
gems. You can easily install these gems by typing them into your command line interface:
gem install async async-http
Getting Started with the Web Scraper
Let's get to the action part. The following is the core code snippet that has been provided:
require 'async'
require 'async/barrier'
require 'async/semaphore'
require 'async/http/internet'
URLS = ["https://forum.rubyonrails.pl/",
"https://forum.rubyonrails.pl/t/jak-zaczac-przygode-z-ruby-on-rails/18"]
Async do |task|
internet = Async::HTTP::Internet.new
barrier = Async::Barrier.new
semaphore = Async::Semaphore.new(8, parent: barrier)
titles = []
urls = []
URLS.each do |url|
semaphore.async do
task.with_timeout(3) do
response = internet.get(URI.parse(url)).read
titles << response.scan(/<title>([^<]+)<\/title>/).first.first
urls << response.scan(/<a href="([^"]+)"/)
end
end
end
barrier.wait
puts "#### TITLES ####"
puts titles
puts "#### URLS ####"
puts urls.uniq.sort
rescue Async::TimeoutError
puts "Timeout"
ensure
internet&.close
end
To scrape a website, we start by ensuring that async
is required, followed by the specific async modules we'll be using, namely Barrier
, Semaphore
and HTTP::internet
. Then we define the list of URLs we will be scraping. The Async
block is the heart of the code. In this block, we start by setting up the Internet session using Async::HTTP::Internet.new
. We then create an Async::Barrier
and an Async::Semaphore
which allows us to control concurrent requests.
We loop through the URL
array and create an asynchronous task for each URL
. Within each of these tasks, we create a timeout limit of 3
seconds to ensure that our script doesn't hang if a page takes too long to respond.
We then make the HTTP request to each page using the internet.get
method. The response variable contains the raw HTML content of each URL visited, which we then parse for the web page titles and for all the URLs present on that page.
When all the requests are complete, we print the retrieved web page titles and URLs to the console. If the request takes longer than 3 seconds, the Async::TimeoutError
saves us by outputting Timeout
and ensuring that the internet session is closed.
Summary
As you see, creating a web scraper with Ruby and Async is quite efficient and elegant due to the asynchronous nature of the tasks. The Async library in Ruby holds great potential for concurrency, making it a choice for tasks that deal with operations such as web scraping.
Happy scraping!