Project: Google Maps Crawler πŸ—ΊπŸͺ²

gui commited 3 years ago · 🐍 Python πŸ€– Automation πŸͺ² Web Scraping
antifragile project python version GitHub

Stack: Python, Selenium

GitHub - guilatrova/GMaps-Crawler: Google Maps crawler using Selenium
Google Maps crawler using Selenium. Contribute to guilatrova/GMaps-Crawler development by creating an account on GitHub.

This is the first project created for the Antifragile Dev series, and its purpose is to collect data from Google Maps and do pretty much whatever we want with it.


πŸ€– What is Selenium

Let's keep it simple: Selenium is a tool that manipulates and interacts with the browser as a regular user would.

It can be used to automate tests by simulating user behavior e.g. like typing, clicking, scrolling, interacting with contents, and checking outputs are being correctly displayed.

For the scope of this project, I didn't test anything, instead, I used it to capture data that would be very boring to do manually - we call it "web scrapping".

Google Maps Crawler running - Example

Create a Selenium Webdriver

This is how it looks like to create a "Selenium web driver" that will interact with Google Chrome:

from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from webdriver_manager.chrome import ChromeDriverManager

IMPLICT_WAIT = 5


def create_driver(headless=False):
    chrome_options = Options()
    if headless:  # πŸ‘ˆ Optional condition to "hide" the browser window
        chrome_options.headless = True

    driver = webdriver.Chrome(ChromeDriverManager().install(), chrome_options=chrome_options) 
    # πŸ‘†  Creation of the "driver" that we're using to interact with the browser
    
    driver.implicitly_wait(IMPLICT_WAIT) 
    # πŸ‘† How much time should Selenium wait until an element is able to interact

    return driver
Creating a Chrome driver with Selenium 

πŸ‘† Note we're using ChromeDriverManager to install the required dependencies for Selenium to manipulate the Chrome browser. That makes the setup a lot easier!

The minimum knowledge you need to get started now:

To be able to do any of those is important that you understand a thing or two about HTML, check some basic commands:

driver = create_driver()  # Method defined in previous examples

driver.get(url)  # πŸ‘ˆ Visits a page

# πŸ‘‡ Finding elements

driver.find_elements(By.XPATH, "*")          # πŸ‘ˆ Get all direct elements
driver.find_element(By.CSS_SELECTOR, "#btn") # πŸ‘ˆ Get one element with id "btn"
driver.find_elements(By.TAG_NAME, "h1")      # πŸ‘ˆ Get all 'h1' elements
driver.find_elements(By.CLASS_NAME, "cls")   # πŸ‘ˆ Get all elements with classname "cls"
Examples on visiting a page and finding elements with Selenium

Defining "the best" way to find an element is harder though...

πŸ› How to debug Selenium with VSCode

My debug process for such applications is always the same. These sites don't want to help you scrap their content, so they make it really hard with random ids and class names.

Consider you want to get the business hours from a restaurant, it's not as straightforward as it looks like, because nothing makes much sense:

Random ids hard to understand

To scrape data from such sites it's quite painful, and consider they might change it anytime and of course they won't notify you.

This is hard to do at a first shot, so I'm sharing some tricks I do to make my life less painful. You can set breakpoints in VSCode at specific moments, and then manipulate the driver right from the debug window.

It's great to minimize guesswork.

-

Finally, ensure to make your code readable, Selenium scripts get messy very quickly, so you always want meaningful methods and functions.

Check a small piece of this project code:

    def get_place_details(self):
        self.wait_restaurant_title_show()

        # DATA
        restaurant_name = self.get_restaurant_name()
        address = self.get_address()
        place = Place(restaurant_name, address)

        if self.expand_hours():
            place.business_hours = self.get_business_hours()

        # TRAITS
        place.extra_attrs = self.get_place_extra_attrs()
        traits_handler = self.get_region(PlaceDetailRegion.TRAITS)
        traits_handler.click()
        place.traits = self.get_traits()

        # REVIEWS
        place.rate, place.reviews = self.get_review()

        # PHOTOS
        place.photo_link = self.get_image_link()

        self.storage.save(place)
        self.hit_back()

The goal is for the code to be self-explanatory and simple to read.

πŸ€” Why you didn't use the Google Maps API?

Mostly due to some feature limitations and rate-limiting.

Also, I'm still hacking this project and I don't even know whether it will work, so I felt like just trying to get something simple real quick to move on.

🟒 What's next?

Since we're willing to build a microservice architecture, we took our initial step:

Now we must publish it to SQS as an event. Unfortunately, we don't have any infra yet... Well, I guess it's time for terraform and CDK. If you don't know those yet, it will be your chance to learn something fun and implemented in a real project.

Watch out for the next blog posts!

🟑 What's pending?

I made a few decisions that are worth sharing:

The application is not scaling yet

The application doesn't crawl until the last page

I'm running it from my own computer

πŸ‘† I don't want to bother about collecting more cities, running into other schedules, etc. I'll get back to it later. We must progress and deliver something simpler but working, and having ~10 restaurants is enough for now.


Follow me on Twitter to keep watching as the project evolves!

  • LinkedIn
  • Tumblr
  • Reddit
  • Google+
  • Pinterest
  • Pocket