How to Scrape Yellow Pages with Web Scraper chrome extension

Click on image to zoom it.

Recently I have started searching for free web scraping software. I have Google about it and I have come across some Google Chrome browser extensions for web scraping. First I am landed at Screen Scraper extension page in Chrome Web Store. I have started testing it by taking Yellow Pages website as a target website. After exploring it by hard, later I come to know that it can scrape data from only one page. Disappointed! Again I started searching for it; I have found another extension known as Scraper. One more time, I disappointed after knowing that it cannot scrape data from multiple web pages at a time.

Finally I found a nice web scraping extension on Google Chrome store named Web Scraper. Web Scraper is a chrome browser extension built for data extraction from web pages. It can extract data from multiple pages. Using this extension you can create a plan (sitemap) that specifies how a web site should be traversed and what should be extracted. Using these sitemaps the Web Scraper will navigate the site accordingly and extract required data. Scraped data later can be exported as CSV (Comma Separated Values).

Let’s explore the extension in detail with example.

1. Install the Web Scraper Extension

You can download and install the Web Scraper extension from the official Google Chrome Store here. After installation, restart the Chrome browser to ensure that the extension is fully loaded. The Web Scraper extension is integrated into Chrome developer tools. To open it, follow the screenshot…


2. Visit the source website

Suppose we want to extract data from Yellow Pages. Let’s open the Yellow Pages from which we want to collect leads.


3. Create Sitemap

In Web Scraper, sitemap is plan or traversal map to collect data from the website. Click the “Create new sitemap” option. The first thing one need to do while creating a sitemap is specifying the sitemap name and starting website URL.

If website is using numbering in page URL then it is easy to specify number range in URL than creating navigational link selector to navigate to a next page.
Range syntax: [start-end:increment]
Here increment is optional.
Example: http://example.com/page/[1-5]
The above example scrapes first five web pages of the website.

After specifying sitemap name and start URL, click save sitemap button. Then you will be shown selectors screen.

4. Create Selectors

Selectors as it name suggests, are html elements that contain data and links to navigate. First, determine which HTML element will be the parent element. Selectors are based on CSS Selectors. You can manually set Selector while creating it. A parent element is the smallest HTML element that contains all the data items we want to collect (in our case it is the element containing all the listing item(data row) details like business name, phone, email, website etc). With the help of Google Chrome Developer Tools (Ctrl+Shift+I), we can easily determine parent element.

Select the top-left tool from the developer tools and select the parent element on the page. By inspecting in the highlighted HTML code, it is clear that parent element is the div element having class name “cell in-area-cell middle-cell”. The following screenshot shows the element which I have selected:

Now Let’s create it in our sitemap as depicted in the following screenshot:

After creating parent selector, create child selector (individual data items/fields). Same as creating parent selector, we can create child selectors. We want to collect data items as shown in the following screenshot:

Business name:
For this field to get scrape, we need to create a child selector of type Text and then select the element.
Address
For this field to get scrape, we need to create a child selector of type Text and then select the element.
Contact Number
For this field to get scrape, we need to create a child selector of type Text and then select the element.
Contact Email
Here, email is not visible to us, but it is there in the HTML code behind. Let us find it by inspecting the HTML source using Google Chrome Developer tools. Email is there in data-email attribute of anchor element having class name “contact.contact-email”
Website
Here, website URL is not visible to use, but it is there in the HTML code behind. Let us find it by inspecting the HTML source using Google Chrome Developer tools. Website URL is there in href attribute of anchor element having class name “contact.contact-url”

The following screenshot shows the child selectors that I have created:

Finally we can view graph representation of sitemap, by going to Sitemap -> Selectors Graph. The following figure show the selectors graph:

After creating sitemap, start scraping the website by going to Sitemap -> Scrape option. You can specify request intervals and page load delay while scraping. Click Start Scraping. After successful scrape, you can browse extracted data using Sitemap->Brows option. The following screenshot shows the sample data extracted:

Using “export data as CSV”, we can download extracted data as CSV file. We can also export sitemap and then later import it. If the result is fulfilling your requirements, you can download the above created sitemap here. After downloading the sitemap, Import it into the Web Scraper extension, and start scraping.

Now Let’s have a look at some pros and cons of Web Scraper:

Pros of Web Scraper:

  • Free of Cost & Good Documentation and Video Tutorials available
  • Ideal for simple and light data extraction jobs
  • It can extract data from multiple pages
  • It can extract data from dynamic websites built using AJAX and JavaScript
  • Data can be extracted to CSV

Cons of Web Scraper:

  • Not suited for complex web scraping jobs
  • Website which requires input for searching cannot be scraped.
  • No support for Captchas and Proxies.
  • Data cannot be exported to other file formats like XML, Excel etc.
  • Less control over web scraping job.

I hope the article is clear enough and well documented. But if still any clarifications needed, don’t hesitate to write in comments. I will be happy to help you.

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>

Your Name (required)

Your Email (required)

Your Skype

Website to Scrape

Your Message