Twitter scraper tutorial with Python: Requests, BeautifulSoup, and Selenium — Part 2Sat 09 April 2016
Inspired by Bruce, my friend’s take-home interview question, I started this bite-size project to play around with some of the most popular PyPI packages: Requests, BeautifulSoup, and Selenium. In this tutorial, I’ll show you the step-by-step process of how I build a Twitter Search web scraper without using their REST API.
This is the part 2 of my Twitter scraper tutorial. If you haven’t checkout part 1, the link is right here. In the last part, I left the tutorial with an unsolved problem — how to scrape the web page that uses infinite scrolling design? Two solutions came into my mind: one more sophisticated, the other more naive:
The more sophisticated approach
The more naive approach
Okay! This is the approach I want to show you. Think about the problem this way: no matter how sophisticated the website is designed, the end result is still a list of tweets loaded on your browser. So my so-called more naive approach is to focus on the end result only. if we could manipulate the browser to load those tweets for me, just as what we see normally, we could use the same the knowledge to parse the HTML file and get the tweets. To automate the browser for us, I’ll show you how I used Selenium.
“Selenium automates browsers.” That’s what the official website says. Selenium Python bindings will help us to use Selenium using Python. Follow the installation page to install it. The code below tells Selenium to use Chrome to open up the Twitter search page and then move down the page for 5 times. Since the browser object provides the handy API to locate the tweets, we don’t need to use the BeautifulSoup again to parse the HTML file. You may run the script now:
import time from selenium import webdriver from selenium.webdriver.common.keys import Keys browser = webdriver.Chrome() base_url = u'https://twitter.com/search?q=' query = u'%40dawranliou' url = base_url + query browser.get(url) time.sleep(1) body = browser.find_element_by_tag_name('body') for _ in range(5): body.send_keys(Keys.PAGE_DOWN) time.sleep(0.2) tweets = browser.find_elements_by_class_name('tweet-text') for tweet in tweets: print(tweet.text)
If you see the following error:
selenium.common.exceptions.WebDriverException: Message: 'chromedriver'
executable needs to be in PATH. Please see
Don’t panic. Read the error message (don’t just google it blindly) and what’s wrong is that you are missing the ‘chromedriver’ executable file. It also suggest you to go to the website. How nice it is! Download the executable from the website given and put it under one of your PATH loacation. For me, I put it under my /usr/local/bin/ folder. You should be fine to run the script by now.
This is the end of this tutorial! Hope you enjoy working with these amazing PyPI packages. Feel free to comment or contact me if you want to learn more. Happy learning! Cheers!