Data Scraping 101

October 13, 2025
Written By Kevin Hemminger

Lorem ipsum dolor sit amet consectetur pulvinar ligula augue quis venenatis. 

Why Learn Data Scraping?

Airbnb originally built their empire by data mining Craigslist. Every time someone offered their home for rental, or as a bed and breakfast, they got an email from Airbnb to the effect of “Hey, would you also like to list your rental on our website?” This was the process they used to create a net worth of 74.71 billion as of October 2025.

Imagine for a moment what was necessary for Airbnb to accomplish this.

They needed continual, automated scraping of Craigslist on all cities and then to send emails to those who listed their properties for rent.

The problem is that Craigslist managed contact between the person listing and Airbnb. They wouldn’t have allowed a company such as Airbnb to email so many people through their system. To get around this, Airbnb allegedly created thousands of Gmail accounts in an effort to disguise their operation.

Another issue is that Gmail also does not allow someone to use 1000s of accounts for the purpose of sending unsolicited emails. They have systems in place to prevent webspam. The system that Airbnb put together must have included the use of proxies, which would allow them to appear as though each of their 1000s of emails were being operated by a different person in a different location.

Let’s recap the challenges above:

  • Scrape data from all cities on Craigslist
  • Analyze listings for keywords indicating the poster was temporarily renting a property, or offering a bed and breakfast
  • Create thousands of Gmail accounts for the purpose of receiving emails from interested renters.
  • Send emails through Craigslist, using the thousands of Gmail accounts as a return email, to the people who were posting their properties for rent.
  • Monitor thousands of Gmail accounts for responses, in order to begin directly communicating with the renter and send the pitch of also listing their rental on Airbnb.
  • Use proxy accounts to trick Craigslist and Google, bypassing restrictions and safeguards to prevent mass spam emails.

This was essentially how Airbnb got their feet off the ground and built a 74 billion dollar business. This is but one of a near unlimited number of business applications for data scraping and business automation.

How to Scrape Data in 2025

There are many ways to scrape data. The process typically involves using your favorite programming language to grab a webpage from the internet, extract data from the raw text, then put the data into a database. Which software you use will depend partially on what you’re comfortable programming with, as well as the difficulty of getting the data from the your target.

You may find that the place you’re scraping data from actually doesn’t want their data to be scraped, so they put measures in place to prevent data scrapers. This may require you to use proxies or captcha-solving solutions in order to get the data you’re after.

You also may find that the data you want is complex to retrieve, requiring a solution that can execute javascript or ajax in order to retrieve the information you want to grab. When you click “view source” on a browser, it doesn’t show the data you want — because the website uses ajax or javascript to load the data you’re after. In cases like this, you will need a solution that uses a browser to execute the javascript prior to you grabbing the data (such as Selenium, Playwright or Node.js).

Other factors include “how do you want to deploy your solution?” If your scraping is for an internal company project that only you will run, Selenium could be a great solution. However, Selenium has issues with deployability. It executes javascript by actually running a browser. When that browser updates itself, your solution will break until you download the latest ChromeDriver for Selenium. You will also need to install Java and have it in your path. After troubleshooting all these issues, congratulations — you have a working Selenium solution (until the next Chrome browser update). If you wanted to distribute this solution to many servers, or many customers — it would quickly become overwhelming and unmanageable.

What then should you use if you want to distribute your solution to many servers, or many clients? Playwright could then be a good choice, as you would develop your program as an installer. The underlying Chrome browser could be locked to a specific version, preventing the solution from failing every time Chrome wants to update itself.