Is it legal to scrape websites?

Web scraping, also known as web data extraction or web harvesting, refers to the automated process of extracting data from websites. It involves writing computer programs or using bots to extract information from the web, which can include text, images, videos, and other types of data. The legality of web scraping depends on several factors.

What is web scraping?

Web scraping involves accessing web pages programmatically through code rather than manually browsing. Scrapers extract the specific pieces of data they need and store it in a structured format like a spreadsheet or database for further analysis. Scrapers can be used to collect all sorts of data from the web, such as product descriptions and prices, news articles, social media posts, research data, and more.

Some common techniques used in web scraping include:

Parsing HTML, XML or JSON code from web pages to extract data
Using Regular Expressions (RegEx) to find and extract matching text and data
Analyzing page structures and scraping data from certain DOM elements

Submitting forms or using APIs to extract data not viewable from static pages
Mimicking human browsing behaviors like mouse movements and clicks

Web scrapers range in complexity from simple scripts anyone can run to robust bots and frameworks that can scrape at scale. Scrapers may also use proxies and rotations to avoid getting blocked when making too many requests to sites.

Why do people web scrape?

There are many legitimate reasons why individuals and companies scrape websites, such as:

Price monitoring – Track prices and price history of products across ecommerce sites.
Market research – Collect data on competitors, products, reviews, trends and more.

News monitoring – Scrape articles and infos from news sites to curate content.
Academic research – Gather data from the web for statistical analysis, text mining and big data.
Search engine indexing – Search engines scrape web pages to catalog content and serve relevant results.

Online directories – Populate listings by scraping business infos like names, addresses, phone numbers.
Private investigation – Investigators may scrape social media and public records for information related to cases.

The key point is that web scraping allows collecting large amounts of web data quickly that would take forever to gather manually. This data can provide important business insights, power research, and assist with online services.

Is web scraping legal?

The legality of web scraping depends on several factors:

Copyright law

Most website content is protected by copyright law. Scraping and reusing content could potentially violate copyrights without permission. However, copyright law also provides fair use exemptions for things like research, criticism, news reporting, and other transformative uses.

Terms of Service

Many websites prohibit scraping in their Terms of Service (ToS). Violating the ToS could constitute a breach of contract. However, the enforceability of ToS varies and blanket scraping bans have faced challenges in court.

The Computer Fraud and Abuse Act

The CFAA prohibits “unauthorized access” to computers. Scraping data could potentially run afoul of the CFAA if done in a way that circumvents access controls like CAPTCHAs, blocks, or IP bans.

State laws

Some states like California and Virginia have computer crime laws that may impose civil and criminal penalties for unauthorized access to data or sites.

Other factors

Things like exceeding reasonable server load, using deception to scrape, violating privacy laws, stealing trade secrets, and violating licensing agreements could also make scraping illegal in some cases.

So in summary – scraping public data on websites in a well-behaved manner is likely fine, but scraping private/restricted data, circumventing access controls, violating ToS, or causing server issues could cross legal lines depending on the circumstances.

Examples of illegal web scraping

Here are some examples of web scraping practices that have raised legal concerns or run afoul of the law:

Scraping pricing data from a competitor’s website after they’ve attempted to block scraping through IP bans and CAPTCHAs.

Copying and republishing articles and content from news sites without permission.
Scraping personal user data like email addresses and bios from a social media site that prohibits scraping in its ToS.
Scraping classifieds ads and reposting them on another site without consent.

Scraping real estate listings without a data licensing agreement in place.
Running scraping bots that overload and crash websites due to excessive requests.
Scraping content that is not publicly accessible or requires a login without permission.

These examples may violate copyright law, CFAA, state computer crime laws, DMCA, Terms of Service, or general hacking laws. But each case depends on the specific facts and jurisdiction.

Ways to scrape legally

Here are some tips to help mitigate legal risks when web scraping:

Review websites’ Terms of Service and only scrape sites that don’t expressly prohibit it or where you have permission.

Avoid circumventing technical access controls like CAPTCHAs and IP blocks.
Check that your scraping activities would qualify as fair use and not overly infringe copyrights.
Don’t access private user accounts or non-public information protected by passwords.

Use throttling, delays, proxies, and user agents to avoid overloading sites.
Scrape responsibly during off-peak hours and respect robots.txt rules.
Do not redistribute copyrighted scraped content verbatim, instead focus on data.

Store scraped data securely and don’t violate user privacy.
Be transparent in stating if any of your services use web scraping.

It’s also good to consult an attorney if you are unsure of the legalities of your specific web scraping project.

Recent legal cases

There have been a few notable legal cases in recent years regarding web scraping:

LinkedIn vs. HiQ Labs (2017)

LinkedIn sent HiQ Labs a cease-and-desist letter for scraping public LinkedIn member profiles. HiQ filed for an injunction arguing it didn’t violate CFAA. Courts ruled HiQ could continue scraping as long as it didn’t bypass technical barriers.

Facebook vs. Power Ventures (2009-2018)

Power Ventures scraped Facebook data after receiving a cease and desist letter. Facebook sued for violating CFAA and CAN-SPAM. Courts ultimately ruled Power Ventures illegally accessed Facebook computers.

StubHub vs. Golden Tickets (2015)

Golden Tickets scraped StubHub event data and used it for ticket resales. StubHub sued citing trespass to chattels. The court enjoined Golden Tickets from scraping, crawling, or using StubHub’s data.

These cases help highlight the nuances in interpreting web scraping laws. Outcomes often depend on specifics like scraping methods, circumvention, data sensitivity, ToS violations, and causing server issues.

Best practices for websites

For websites looking to control scraping, here are some best practices:

Implement a clear Terms of Service prohibiting scraping without consent.
Use technical protections like CAPTCHAs, API keys, blocking, rate limiting.
Detect and monitor for scraping activity based on traffic patterns.

Send cease-and-desist letters to suspected scrapers.
Offer data licensing options or official APIs for preferred access.
Identify any non-public data that should not be scraped.

Be cautious in outright blocking parties without first contacting them.

However, there are limits on the enforceability of anti-scraping measures, especially for public data. The goal should be deterring abusive scraping, not preventing all copying of public info. Excessive restrictions could raise legal risks under fair use, innovation stifling, and antitrust laws.

Conclusion

In summary, whether web scraping is legal depends on the specific circumstances:

Scraping public data reasonably and non-intrusively is likely permissible.
Anything violating copyright law, CFAA, state computer laws, privacy laws or express contracts/ToS may be unlawful.
Liability hinges on factors like methods, data sensitivity, circumvention, server impact, redistribution.

Scrapers should scrape ethically, while websites balance deterrence and public access.
There are still open questions around scraping rights the law continues evolving on.

Parties on both sides should be thoughtful about their web scraping practices and claims. Contacting an attorney for guidance around specific scraping projects or disputes is recommended.

Factor	Favors Legality	Raises Legal Risk
Copyright	Non-copyrightable facts, public domain info	Verbatim copying of creative expression
Terms of Service	No policy against scraping	Express scraping prohibition
Circumvention	Accessing publicly available data	Bypassing passwords, blocks, CAPTCHAs
Server impact	Light load, scraping responsibly	Overloading, crashing servers
Privacy	Only public data, respecting opt-outs	Private user info, violating privacy laws

This table summarizes factors that may sway the legality question in either direction.

References

Legal Information Institute. (2022). Copyright Law of the United States. Cornell Law School. https://www.law.cornell.edu/copyright
Kadri, T.M. (2019). The CFAA & Terms of Service: Be Careful What You Wish For. British and Irish Law, Education and Technology Association. https://papers.ssrn.com/sol3/papers.cfm?abstract_id=3354692

Opsahl, K. (2010). California Penal Code Section 502 – Good Intentions Gone Bad. Electronic Frontier Foundation. https://www.eff.org/issues/cfaa
Legal Information Institute. (2022). LinkedIn Corp. v. HiQ Labs, Inc. Cornell Law School. https://www.law.cornell.edu/supremecourt/text/19-1116
Courtlistener. (2020). Facebook v. Power Ventures. https://www.courtlistener.com/docket/4296790/facebook-v-power-ventures/

Justia. (2015). Stubhub, Inc. v. Golden Tickets Srl. https://law.justia.com/cases/california/court-of-appeal/2015/a140796.html