Scrapeless
Scrapeless offers flexible and feature-rich data acquisition services with extensive parameter customization and multi-format export support. These capabilities empower LangChain to integrate and leverage external data more effectively. The core functional modules include:
DeepSerp
- Google Search: Enables comprehensive extraction of Google SERP data across all result types.
- Supports selection of localized Google domains (e.g.,
google.com
,google.ad
) to retrieve region-specific search results. - Pagination supported for retrieving results beyond the first page.
- Supports a search result filtering toggle to control whether to exclude duplicate or similar content.
- Supports selection of localized Google domains (e.g.,
- Google Trends: Retrieves keyword trend data from Google, including popularity over time, regional interest, and related searches.
- Supports multi-keyword comparison.
- Supports multiple data types:
interest_over_time
,interest_by_region
,related_queries
, andrelated_topics
. - Allows filtering by specific Google properties (Web, YouTube, News, Shopping) for source-specific trend analysis.
Universal Scraping
- Designed for modern, JavaScript-heavy websites, allowing dynamic content extraction.
- Global premium proxy support for bypassing geo-restrictions and improving reliability.
Crawler
- Crawl: Recursively crawl a website and its linked pages to extract site-wide content.
- Supports configurable crawl depth and scoped URL targeting.
- Scrape: Extract content from a single webpage with high precision.
- Supports "main content only" extraction to exclude ads, footers, and other non-essential elements.
- Allows batch scraping of multiple standalone URLs.
Overviewโ
Integration detailsโ
Class | Package | Serializable | JS support | Package latest |
---|---|---|---|---|
ScrapelessUniversalScrapingTool | langchain-scrapeless | โ | โ |
Tool featuresโ
Native async | Returns artifact | Return data |
---|---|---|
โ | โ | html, markdown, links, metadata, structured content |
Setupโ
The integration lives in the langchain-scrapeless
package.
!pip install langchain-scrapeless
Credentialsโ
You'll need a Scrapeless API key to use this tool. You can set it as an environment variable:
import os
os.environ["SCRAPELESS_API_KEY"] = "your-api-key"
Instantiationโ
Here we show how to instantiate an instance of the Scrapeless Universal Scraping Tool. This tool allows you to scrape any website using a headless browser with JavaScript rendering capabilities, customizable output types, and geo-specific proxy support.
The tool accepts the following parameters during instantiation:
url
(required, str): The URL of the website to scrape.headless
(optional, bool): Whether to use a headless browser. Default is True.js_render
(optional, bool): Whether to enable JavaScript rendering. Default is True.js_wait_until
(optional, str): Defines when to consider the JavaScript-rendered page ready. Default is'domcontentloaded'
. Options include:load
: Wait until the page is fully loaded.domcontentloaded
: Wait until the DOM is fully loaded.networkidle0
: Wait until the network is idle.networkidle2
: Wait until the network is idle for 2 seconds.
outputs
(optional, str): The specific type of data to extract from the page. Options include:phone_numbers
headings
images
audios
videos
links
menus
hashtags
emails
metadata
tables
favicon
response_type
(optional, str): Defines the format of the response. Default is'html'
. Options include:html
: Return the raw HTML of the page.plaintext
: Return the plain text content.markdown
: Return a Markdown version of the page.png
: Return a PNG screenshot.jpeg
: Return a JPEG screenshot.
response_image_full_page
(optional, bool): Whether to capture and return a full-page image when using screenshot output (png or jpeg). Default is False.selector
(optional, str): A specific CSS selector to scope scraping within a part of the page. Default isNone
.proxy_country
(optional, str): Two-letter country code for geo-specific proxy access (e.g.,'us'
,'gb'
,'de'
,'jp'
). Default is'ANY'
.
Invocationโ
Basic Usageโ
from langchain_scrapeless import ScrapelessUniversalScrapingTool
tool = ScrapelessUniversalScrapingTool()
# Basic usage
result = tool.invoke("https://example.com")
print(result)
<!DOCTYPE html><html><head>
<title>Example Domain</title>
<meta charset="utf-8">
<meta http-equiv="Content-type" content="text/html; charset=utf-8">
<meta name="viewport" content="width=device-width, initial-scale=1">
<style type="text/css">
body {
background-color: #f0f0f2;
margin: 0;
padding: 0;
font-family: -apple-system, system-ui, BlinkMacSystemFont, "Segoe UI", "Open Sans", "Helvetica Neue", Helvetica, Arial, sans-serif;
}
div {
width: 600px;
margin: 5em auto;
padding: 2em;
background-color: #fdfdff;
border-radius: 0.5em;
box-shadow: 2px 3px 7px 2px rgba(0,0,0,0.02);
}
a:link, a:visited {
color: #38488f;
text-decoration: none;
}
@media (max-width: 700px) {
div {
margin: 0 auto;
width: auto;
}
}
</style>
</head>
<body>
<div>
<h1>Example Domain</h1>
<p>This domain is for use in illustrative examples in documents. You may use this
domain in literature without prior coordination or asking for permission.</p>
<p><a href="https://www.iana.org/domains/example">More information...</a></p>
</div>
</body></html>