- AUTHOR Arti Marane
- PUBLISHED ON July 14, 2020
Let’s get started!
What is Web Scraping?
Web scraping is a full or partial automated way of extracting large chunks of data from websites which can then be downloaded on computer or accessed using xls or databasesWeb scraping automates the process of viewing, download and extraction of accurate and reliable data from web pages that can be used for intelligence, analysis , strategy finalization etc
We can scrape diversified data from different sources as per business need. It could be text, images, email ids, phone numbers, videos etc. For any specific projects, we may need domain specific data such as financial data for some analytics, reviews / price/ product information for e-commerce business etc. From a programming perspective many languages and frameworks exist such as Java, python, node, java script etc. These frameworks help to scrape data and read through HTML or web pages. We will dive into details of those in my next article.
How does Web Scraping Work?
Web scraping usually works in three steps.
1. Step I : Configure request-response
The first step in web scraping application development ( “scraper”) is to request the website for the contents. In return, the scraper gets the requested information in HTML/ TEXT/XML/JSON format. To request data, the application needs basic awareness about request type, request parameter, headers , login details if required , security policies, cookies , tokens etc. Ideal web scraper application keeps this information configurable per website. Configuration helps to parse multiple websites using a single application , as well as easy to adopt, for any future modifications in the website.
2. Step II : Parse data & extract information
This step involves reading HTML from specific page and extracting some labels, input controls , tables, title of the page, paragraphs , body , links , heading etc
Here regular expressions / groups of regular expressions are written and passed to the engine as a configuration. Engine will parse HTML pages using these regular expressions and extract corresponding associated text, images, formatted data, PDF from pages.
3. Step III : Save Data
This step is the final one and extracted data is dumped into CSV, DB, JSON or text file and later retrieved either manually or programmatically for further processing.
Scraping activity may be repeated on some intervals based on business needs, for example some data needs to be crawled per day , some needs scraping per week and for some websites scraping might be needed to repeat on an hourly basis. Usually Scheduler/Crawler is used to achieve this.
How web scraping helps to grow businesses?
Web scraping is used in numerous applications . It can be used in almost every known area around us. I am listing out couple of major areas below for reference
1. E-Commerce-Data analysis & competitive pricing and Content creation
Product details (data) and prices are of paramount significance in deciding the strategy of e-commerce business. This is where web scraping gives a great edge. Using this, data can be scraped from multiple sources, also prices, description can be scraped on a regular basis and used for analysis and deciding strategy. It also helps for small ecommerce platforms to bring thousands of product details on single click.
2. Lead generation
Marketing is very important ranging from small scale to giant businesses. Targeted marketing requires all details such as to whom data needs to be sent and this is nothing but what we call as lead generation process. For marketing we need hundreds/thousands of leads such as websites, email , phone etc. Scraper helps to extract this information from selected targets and provide a consolidated view inside DB / CSV / XLS on single click.
Is web scraping legal?
In today’s world a lot of data is exposed over the internet, that may be copyrighted or public one. As a web scraper , you need to consider a few important points regarding legality of the data.
- Scraping public data, non copyrighted images will keep you safe.
- Scraping copyrighted data for commercial purposes, is unethical and illegal.[ As long as you don’t violate its terms, you will be safe, so read terms and conditions carefully before scraping data !! ]
- In case, if target websites provide/expose API , it’s better to use those, but again make sure you do not violate terms and conditions.
- Define A reasonable scrapping rate such as 1 request per 15-30 seconds
- Avoid aggressive & relentless scraping. The more is the frequency to hit the target website, high chances of the scrapper getting blocked.
- Always follow the rule , “if there is doubt, ask and then proceed”.
Web Scraping challenges
During scraping you will definitely face multiple challenges. Some are time consuming but non blocker and some are blocker and may even break your ongoing application
- Frequent Changes in the structure or UI of web pages
- HoneyPot Traps
- Anti-scraping Mechanisms such as Captcha
- Quality of data
Let’s have a brief about each of those:
1. Frequent UI changes
UI and structure changes may add challenge in your working environment
- Most of the websites keep updating their UI and features to enhance user experience and to retain existing clients. This leads to multiple changes in UI structure
- Usually scrapper or crawler are written and configured considering the user interface for the website. This means slightest change in target website can crash scrapper or may lead to scrapping inaccurate data, thus violating the purpose
- Hence, need to update or modify scraper on UI changes in target website is must, dealing with these changes frequently is major a challenge
2. HoneyPot Traps
Many websites have mechanisms to protect the data.HoneyPot is one of that
- Some hidden controls or links are placed inside UI by target site
- Scraper usually traits these links as regular links and start scraping or downloading data
- These links tracks IP addresses of scrapper applications and records the same, hit to website beyond certain threshold will result in blocking of Ip
- Also some traps are infinitely deep directories and scraper will keep scraping links recursively and after few attempts it will block IP
3. Anti-scraping Technologies
Some social networking sites or websites holding sensitive data have anti -scraping technologies such as captcha, application firewalls and some anti bot services
- Websites such as LinkedIn, financial websites have aggressive anti -scraping technologies, and can defeat any scraping attempts.
- These websites have bot or scraper detecting mechanisms and may block IP, blacklist your company or may send legal notice etc.
- Sometimes robots.txt is used explicitly to indicate scraping is not allowed.
4. Data Quality
Data is the most important asset for some businesses and used for analysis and producing some stats. Ensuring accuracy of data is very important and scrapers need to focus on the same. Two important aspects of the scraping are
- Quality of data
- Data scraping consistency
With latest security improvement and frequent structure changes its big challenge for scraper to achieve consistency and data quality on a regular basis.