mirror of
https://github.com/lobehub/lobe-chat.git
synced 2026-06-13 19:20:04 +00:00
ddb5794826
* chore: clean up LOBE-XXX annotations from codebase comments - Remove 【LOBE-XXX】 bracket markers - Remove LOBE-XXXX references from inline comments - Clean up test descriptions containing LOBE identifiers - Preserve linear.app URLs and code-level regex patterns - Generated: 2026-05-23 02:30:09 * 🐛 fix(tests): restore () in arrow callbacks broken by annotation cleanup The LOBE-XXX annotation cleanup script over-matched `(LOBE-XXXX', () =>` and stripped the callback `()`, leaving invalid syntax like `describe(..., => {` and `it(..., async => {` across 24 test files. This caused parse failures in Test Packages, Test Desktop App, Test Database lint, and Test App shard runs. Restoring `()` / `async ()` unblocks the suites while keeping the ticket-text cleanup intact. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * 🐛 fix(hintFormat-test): restore label + ellipsis in stripMarkdownLinks fixture The annotation cleanup stripped `LOBE-8516` from a markdown-link's *label* (`[LOBE-8516](/task/T-1)` → `[](/task/T-1)`), which then survived `stripMarkdownLinks` because the pattern requires non-empty link text — the test expected the link to disappear and asserted equality on a LOBE-free output. The same line also lost a `.` from the trailing `...` indicator in both input and expected strings. Substitute a neutral Chinese label (`发布计划`) so the link continues to exercise the multi-link substitution path, and restore the full `...` ellipsis. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Arvin Xu <arvinxx@lobehub.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@lobechat/web-crawler
LobeHub's built-in web crawling module for intelligent extraction of web content and conversion to Markdown format.
📝 Introduction
@lobechat/web-crawler is a core component of LobeHub responsible for intelligent web content crawling and processing. It extracts valuable content from various webpages, filters out distracting elements, and generates structured Markdown text.
🛠️ Core Features
- Intelligent Content Extraction: Identifies main content based on Mozilla Readability algorithm
- Multi-level Crawling Strategy: Supports multiple crawling implementations including basic crawling, Jina, Search1API, and Browserless rendering
- Custom URL Rules: Handles specific website crawling logic through a flexible rule system
🤝 Contribution
Web structures are diverse and complex. We welcome community contributions for specific website crawling rules. You can participate in improvements through:
How to Contribute URL Rules
- Add new rules to the urlRules.ts file
- Rule example:
// Example: handling specific websites
const url = [
// ... other URL matching rules
{
// URL matching pattern, supports regex
urlPattern: 'https://example.com/articles/(.*)',
// Optional: URL transformation, redirects to an easier-to-crawl version
urlTransform: 'https://example.com/print/$1',
// Optional: specify crawling implementation, supports 'naive', 'jina', 'search1api', and 'browserless'
impls: ['naive', 'jina', 'search1api', 'browserless'],
// Optional: content filtering configuration
filterOptions: {
// Whether to enable Readability algorithm for filtering distracting elements
enableReadability: true,
// Whether to convert to plain text
pureText: false,
},
},
];
Rule Submission Process
- Fork the LobeHub repository
- Add or modify URL rules
- Submit a Pull Request describing:
- Target website characteristics
- Problems solved by the rule
- Test cases (example URLs)
📌 Note
This is an internal module of LobeHub ("private": true), designed specifically for LobeHub and not published as a standalone package.