DATA SCRAPING - TYPES, USES, & WHY IT MATTERS
It’s a hotly debated topic and technique used by businesses and criminals alike for money- and decision-making purposes. 38% of companies use web scraping for content and market research, with real estate being the number one target of web scraping according to the 2016 Economics of Web Scraping Report by Distil Networks.
Data scraping is a method that empowers professionals with various tools to work with data - be it extracting, analyzing, or integrating. Leveraging its ability to efficiently extract data from multiple websites, or extracting data from a legacy system when no API is available, data scraping is an efficient way to replace cumbersome, and many times ineffective, programs or tasks humans are completing.
Employed by nearly every industry from sports to government or corporations to criminals around the world, web and content scraping tools are the competitive advantage that makes - or the nuisance that costs - businesses and individuals millions of dollars each year.
So what is scraping? What is the difference between the types? How could you use it for your business and what tools are available?
WHAT IS SCRAPING
Scraping refers to extracting data or content from a website or series of websites, a database, enterprise application, or legacy system. This data is exported into a file or program that is then used for a specific purpose or to be integrated/migrated into a new system.
WEB VS SCREEN SCRAPING
Although these terms may at times be used interchangeably, web and screen scraping are two separate scraping techniques. The lines become blurred as screen scraping can be completed on the web, or web scraping is sometimes used during migrations, but it’s easiest to view web scraping primarily as a tool for “Data Analysis, Acquisition, & Research” and screen scraping as a tool for “Integration & Migration”.
The difference between web and screen scraping is:
- Web Scraping - Extracts data or content from the web. Content scraping is a component of web scraping. Primarily used for research, analysis, comparison, strategy, extracting specific information from one massive source or multiple sources
- Screen Scraping - Extracts screen and other data from an application, desktop, web, or legacy system. Primarily used for scraping ERP or CRM data to integrate into a new system, mirror the display of legacy system, migrate content, business process automation, etc.
Web or content scraping can take place manually by a human or automatically through a program. Screen scraping is accomplished through a program. It is a versatile tool for data migration as it enables accurate extraction and integration of legacy systems data into a newer, more cost efficient and effective platform.
Web scraping opened the door to the internet for people around the world in the 2000s. Search engines used web scrapers called “Web Crawlers” to inspect the content and data of millions of websites. The keywords and data extracted were then indexed and used to power the search engines users use to navigate the web. Without web crawlers, we would not have Google, Yahoo!, or Bing.
Web scraping or content scraping is used for a number of reasons across industries:
- Price Comparison
- Market & Competitor Research
- Contact Scraping (Email and Contact Info)
- Weather or Currency Data Monitoring
- Marketing - Content Creation, SEO, Metadata, etc.
- Decision Making & Planning
Some applications of web scraping include:
- Search Engines - Extract relevant information from websites to display in relation to search criteria
- Sports - Tracking sports for stats, fantasy, bets, etc.
- Government - Tracking inflation, currency, or news for a specific country
- Real Estate - Tracking the prices for housing markets, property or rentals, competitor comparison, and more
- Marketing - Tracking social media sentiment around consumer confidence, SEO, metadata, content scraping, keywords, adword copy, potential influencers, and more
- Pricing - Compare the prices of tickets, airlines, hotels, festivals, products or any number of items or services to source the best deal or price accordingly
- Unethical/Hacking - Denial of Service (DoS) attack, price scraping and beating, stealing content, contact or account information
Web scraping programs are called bots, crawlers, spiders, harvesters, etc. Popular sites such as Facebook, Twitter, and Youtube provide their APIs publicly for developers to access their data in a structured way. But when APIs are not available or different data needs to be extracted, a web scraping program is created using Python, Ruby, PHP, or many other popular languages.
Some examples of online web scraping tools available (to name a few) include:
- FlightStats for real-time airline transport data
- Wikibuy for product pricing comparison
- Web Scraper chrome extension for site maps
- The SEO Spider tool Screaming Frog
- Content scraper tool Ahrefs Site Explorer
Some examples of screen scraping tools (to name a few) include:
- UiPath - Comprehensive screen scraper to pull data from any application in minutes
- Jacada - Jacada Integration and Automation (JIA) is a reliable data integration, desktop automation & windows/web app screen scraping
- Macro Scheduler - Powerful screen text capture, OCR functions, and multiple tools
The fast-paced evolution of technology means legacy systems, software, or applications become obsolete and costly to maintain. These large investments hold a wealth of sensitive and important information. In a 2017 study completed by SnapLogic and the independent research firm Vanson Bourne that surveying 500 US & IT it was discovered critical data trapped in legacy systems and disconnected data roadmaps added up to nearly $140 billion in missed opportunities and additional costs.
Screen scraping a system in its entirety is crucial for certain companies, especially when data needs to be kept intact for regulatory or record keeping purposes.
Screen scraping supports accurate system integration/migration for:
- Crucial Legacy Systems - Highly accurate and complete migration of all system data
- Governments - public and government records
- Health Care Providers - health records for patients
- Banks - legal documents, account information, and transaction records
- Energy & Mining - crucial legacy systems data, records, approvals, etc.
- Corporations & Multi-Nationals - Enterprise data from ERP, CRM, SCM, and other systems
Screen scraping is a technique that scrapes the data straight from the screen intended for the user, extracting this data using generic APIs and without accessing the source code. Many older CRM systems do not have a built in API, which makes screen scraping a powerful tool for migrations, due to its ability to access and export legacy data with a high-level of accuracy.
Screen scraping techniques include:
- Using standard APIs to analyze screen contents
- System API interception to monitor (catch) how data reaches the screen
- Custom mirror driver or accessibility driver
- Using Optical character recognition (OCR)
Whether it is from a CRM, ERP, SAP, ORACLE, MS Office, or other desktop application/system a screen scraper gathers all the information needed to ensure the logic and data from the legacy system remains intact for successful integration into the new system.
DATA SCRAPING - THE GOOD, THE BAD, AND THE UGLY
Legally, web scraping is used for marketing efforts and research to price, monitor, analyze, and aggregate information that supports decision making, content creation, or marketing efforts. Unethically and in certain views illegally, it is used to steal and re-share copyrighted content or automate the matching and beating of competitors’ pricing.
Imagine a company puts on a promotion to generate sales but doesn’t know a competitor is a step ahead using bots and a web scraper. The web scraper identified their new price soon after it is online, with a bot updating their product price to beat it. More bots visit and overload the company’s site with traffic, creating what is called a Denial of Service (DoS) attack, slowing down or crashing their site to redirect customers to their competitor’s site instead.
Ebay experienced the dark side of web scraping when their servers where nearly taken down back in the early 2000’s by Bidder’s Edge. In this instance, Bidder’s Edge was a data aggregator for auction sites that scraped the prices of items from eEbay, eventually reaching a volume where it was pulling so much data it disrupted the servers (Read more: Court Case eBay vs Bidder’s Edge).
Illegally, spammers or scammers use it to harvest email addresses to send malicious mail or scams. It is used to hack websites or business intranets, and extract (steal) information to commit other types of crime, blackmail, or fraud.
LEVERAGE THE POWER OF DATA SCRAPING
Whether you are upgrading your legacy system or want to further explore how to leverage the power of web- or content-scraping for your business, contact us today here at The SilverLogic with any questions.
Our award-winning team of software engineers and techsperts are customer focused solution architects ready to build a custom solution for your e-commerce/online business or enterprise. Together we simplify the process of upgrading your system or building a custom scraping tool for web, data migration, marketing, or any other applications.
Since 2012 our team has helped clients navigate the maze of investing vs spending on tech solutions, providing a number of services and solutions to help collaboratively create their own custom made competitive advantage.