mirror of
https://github.com/lobehub/lobe-chat.git
synced 2026-06-14 03:30:19 +00:00
💄 style: add blockAds & stealth params for Browserless (#8255)
* ✨ feat: add blockAds & stealth params for Browserless * Apply `sourcery-ai` suggestion Co-authored-by: sourcery-ai[bot] <58596630+sourcery-ai[bot]@users.noreply.github.com> * 📝 docs: add docs for `BROWSERLESS_BLOCK_ADS` & `BROWSERLESS_STEALTH_MODE` --------- Co-authored-by: sourcery-ai[bot] <58596630+sourcery-ai[bot]@users.noreply.github.com>
This commit is contained in:
@@ -88,6 +88,40 @@ BROWSERLESS_URL=https://chrome.browserless.io
|
||||
|
||||
---
|
||||
|
||||
## `BROWSERLESS_BLOCK_ADS`
|
||||
|
||||
Enables ad blocking functionality. When using [Browserless](https://www.browserless.io/) for web scraping, it automatically blocks common ad resources (such as scripts, images, trackers, etc.), improving scraping speed and page clarity.
|
||||
|
||||
```env
|
||||
BROWSERLESS_BLOCK_ADS=1
|
||||
```
|
||||
|
||||
> 📌 Supported values:
|
||||
>
|
||||
> * `1`: Enable ad blocking (recommended);
|
||||
> * `0`: Disable ad blocking (default).
|
||||
|
||||
> ✅ It is recommended to use with `BROWSERLESS_STEALTH_MODE=1` to enhance stealth and scraping success rate.
|
||||
|
||||
---
|
||||
|
||||
## `BROWSERLESS_STEALTH_MODE`
|
||||
|
||||
Enables stealth mode. When using [Browserless](https://www.browserless.io/) for web scraping, it applies various anti-detection techniques (such as modifying the user agent, removing webdriver traits, simulating user interactions) to bypass anti-bot mechanisms.
|
||||
|
||||
```env
|
||||
BROWSERLESS_STEALTH_MODE=1
|
||||
```
|
||||
|
||||
> 📌 Supported values:
|
||||
>
|
||||
> * `1`: Enable stealth mode (recommended);
|
||||
> * `0`: Disable stealth mode (default).
|
||||
|
||||
> ⚠️ Some websites use advanced anti-scraping techniques. Enabling stealth mode can significantly improve scraping success rate.
|
||||
|
||||
---
|
||||
|
||||
## `GOOGLE_PSE_ENGINE_ID`
|
||||
|
||||
Configure the Search Engine ID for Google Programmable Search Engine (Google PSE), used to restrict the search scope. Must be used alongside `GOOGLE_PSE_API_KEY`.
|
||||
|
||||
@@ -84,6 +84,40 @@ BROWSERLESS_URL=https://chrome.browserless.io
|
||||
|
||||
---
|
||||
|
||||
## `BROWSERLESS_BLOCK_ADS`
|
||||
|
||||
启用广告拦截功能,在使用 [Browserless](https://www.browserless.io/) 进行网页抓取时自动屏蔽常见广告资源(如脚本、图片、追踪器等),提高抓取速度与页面清晰度。
|
||||
|
||||
```env
|
||||
BROWSERLESS_BLOCK_ADS=1
|
||||
```
|
||||
|
||||
> 📌 支持的值:
|
||||
>
|
||||
> * `1`:启用广告拦截(推荐);
|
||||
> * `0`:禁用广告拦截(默认)。
|
||||
|
||||
> ✅ 建议与 `BROWSERLESS_STEALTH_MODE=1` 一起使用,提高爬虫的隐蔽性和成功率。
|
||||
|
||||
---
|
||||
|
||||
## `BROWSERLESS_STEALTH_MODE`
|
||||
|
||||
启用隐身模式,在使用 [Browserless](https://www.browserless.io/) 抓取网页时,通过一系列防检测手段(如修改 UA、移除 webdriver 特征、模拟用户操作)来规避反爬虫机制。
|
||||
|
||||
```env
|
||||
BROWSERLESS_STEALTH_MODE=1
|
||||
```
|
||||
|
||||
> 📌 支持的值:
|
||||
>
|
||||
> * `1`:启用隐身模式(推荐);
|
||||
> * `0`:禁用隐身模式(默认)。
|
||||
|
||||
> ⚠️ 某些网站存在高级反爬机制,启用隐身模式可以显著提升抓取成功率。
|
||||
|
||||
---
|
||||
|
||||
## `GOOGLE_PSE_ENGINE_ID`
|
||||
|
||||
配置 Google Programmable Search Engine(Google PSE)的搜索引擎 ID,用于限定搜索范围。需配合 `GOOGLE_PSE_API_KEY` 一起使用。
|
||||
|
||||
@@ -10,6 +10,9 @@ const REJECT_REQUEST_PATTERN =
|
||||
'.*\\.(?!(html|css|js|json|xml|webmanifest|txt|md)(\\?|#|$))[\\w-]+(?:[\\?#].*)?$';
|
||||
const BROWSERLESS_TOKEN = process.env.BROWSERLESS_TOKEN;
|
||||
|
||||
const BROWSERLESS_BLOCK_ADS = process.env.BROWSERLESS_BLOCK_ADS === '1';
|
||||
const BROWSERLESS_STEALTH_MODE = process.env.BROWSERLESS_STEALTH_MODE === '1';
|
||||
|
||||
class BrowserlessInitError extends Error {
|
||||
constructor() {
|
||||
super('`BROWSERLESS_URL` or `BROWSERLESS_TOKEN` are required');
|
||||
@@ -30,7 +33,14 @@ export const browserless: CrawlImpl = async (url, { filterOptions }) => {
|
||||
|
||||
try {
|
||||
const res = await fetch(
|
||||
qs.stringifyUrl({ query: { token: BROWSERLESS_TOKEN }, url: urlJoin(BASE_URL, '/content') }),
|
||||
qs.stringifyUrl({
|
||||
query: {
|
||||
blockAds: BROWSERLESS_BLOCK_ADS,
|
||||
launch: JSON.stringify({ stealth: BROWSERLESS_STEALTH_MODE }),
|
||||
token: BROWSERLESS_TOKEN,
|
||||
},
|
||||
url: urlJoin(BASE_URL, '/content'),
|
||||
}),
|
||||
{
|
||||
body: JSON.stringify(input),
|
||||
headers: {
|
||||
|
||||
Reference in New Issue
Block a user