💄 style: add blockAds & stealth params for Browserless (#8255)

*  feat: add blockAds & stealth params for Browserless

* Apply `sourcery-ai` suggestion

Co-authored-by: sourcery-ai[bot] <58596630+sourcery-ai[bot]@users.noreply.github.com>

* 📝 docs: add docs for `BROWSERLESS_BLOCK_ADS` & `BROWSERLESS_STEALTH_MODE`

---------

Co-authored-by: sourcery-ai[bot] <58596630+sourcery-ai[bot]@users.noreply.github.com>
This commit is contained in:
Zhijie He
2025-06-23 14:55:17 +08:00
committed by GitHub
parent f63b137428
commit 2ff3efa630
3 changed files with 79 additions and 1 deletions
@@ -88,6 +88,40 @@ BROWSERLESS_URL=https://chrome.browserless.io
---
## `BROWSERLESS_BLOCK_ADS`
Enables ad blocking functionality. When using [Browserless](https://www.browserless.io/) for web scraping, it automatically blocks common ad resources (such as scripts, images, trackers, etc.), improving scraping speed and page clarity.
```env
BROWSERLESS_BLOCK_ADS=1
```
> 📌 Supported values:
>
> * `1`: Enable ad blocking (recommended);
> * `0`: Disable ad blocking (default).
> ✅ It is recommended to use with `BROWSERLESS_STEALTH_MODE=1` to enhance stealth and scraping success rate.
---
## `BROWSERLESS_STEALTH_MODE`
Enables stealth mode. When using [Browserless](https://www.browserless.io/) for web scraping, it applies various anti-detection techniques (such as modifying the user agent, removing webdriver traits, simulating user interactions) to bypass anti-bot mechanisms.
```env
BROWSERLESS_STEALTH_MODE=1
```
> 📌 Supported values:
>
> * `1`: Enable stealth mode (recommended);
> * `0`: Disable stealth mode (default).
> ⚠️ Some websites use advanced anti-scraping techniques. Enabling stealth mode can significantly improve scraping success rate.
---
## `GOOGLE_PSE_ENGINE_ID`
Configure the Search Engine ID for Google Programmable Search Engine (Google PSE), used to restrict the search scope. Must be used alongside `GOOGLE_PSE_API_KEY`.
@@ -84,6 +84,40 @@ BROWSERLESS_URL=https://chrome.browserless.io
---
## `BROWSERLESS_BLOCK_ADS`
启用广告拦截功能,在使用 [Browserless](https://www.browserless.io/) 进行网页抓取时自动屏蔽常见广告资源(如脚本、图片、追踪器等),提高抓取速度与页面清晰度。
```env
BROWSERLESS_BLOCK_ADS=1
```
> 📌 支持的值:
>
> * `1`:启用广告拦截(推荐);
> * `0`:禁用广告拦截(默认)。
> ✅ 建议与 `BROWSERLESS_STEALTH_MODE=1` 一起使用,提高爬虫的隐蔽性和成功率。
---
## `BROWSERLESS_STEALTH_MODE`
启用隐身模式,在使用 [Browserless](https://www.browserless.io/) 抓取网页时,通过一系列防检测手段(如修改 UA、移除 webdriver 特征、模拟用户操作)来规避反爬虫机制。
```env
BROWSERLESS_STEALTH_MODE=1
```
> 📌 支持的值:
>
> * `1`:启用隐身模式(推荐);
> * `0`:禁用隐身模式(默认)。
> ⚠️ 某些网站存在高级反爬机制,启用隐身模式可以显著提升抓取成功率。
---
## `GOOGLE_PSE_ENGINE_ID`
配置 Google Programmable Search EngineGoogle PSE)的搜索引擎 ID,用于限定搜索范围。需配合 `GOOGLE_PSE_API_KEY` 一起使用。
@@ -10,6 +10,9 @@ const REJECT_REQUEST_PATTERN =
'.*\\.(?!(html|css|js|json|xml|webmanifest|txt|md)(\\?|#|$))[\\w-]+(?:[\\?#].*)?$';
const BROWSERLESS_TOKEN = process.env.BROWSERLESS_TOKEN;
const BROWSERLESS_BLOCK_ADS = process.env.BROWSERLESS_BLOCK_ADS === '1';
const BROWSERLESS_STEALTH_MODE = process.env.BROWSERLESS_STEALTH_MODE === '1';
class BrowserlessInitError extends Error {
constructor() {
super('`BROWSERLESS_URL` or `BROWSERLESS_TOKEN` are required');
@@ -30,7 +33,14 @@ export const browserless: CrawlImpl = async (url, { filterOptions }) => {
try {
const res = await fetch(
qs.stringifyUrl({ query: { token: BROWSERLESS_TOKEN }, url: urlJoin(BASE_URL, '/content') }),
qs.stringifyUrl({
query: {
blockAds: BROWSERLESS_BLOCK_ADS,
launch: JSON.stringify({ stealth: BROWSERLESS_STEALTH_MODE }),
token: BROWSERLESS_TOKEN,
},
url: urlJoin(BASE_URL, '/content'),
}),
{
body: JSON.stringify(input),
headers: {