Sync Data from Website

This document primarily introduces how to scrape data from a web page, parse it into Markdown, and import it into the Dify knowledge base.

Dify knowledge base supports crawling content from public web pages using third-party tools such as Jina Reader and Firecrawl, parsing it into Markdown content, and importing it into the knowledge base.

Firecrawl and Jina Reader are both open-source web parsing tools that can convert web pages into clean Markdown format text that is easy for LLMs to recognize, while providing easy-to-use API services.Comment

The following sections will introduce the usage methods for Firecrawl and Jina Reader respectively.

Firecrawl

1. Configure Firecrawl API credentials

Click on the avatar in the upper right corner, then go to the DataSource page, and click the Configure button next to Firecrawl.

Configuring Firecrawl Credentials

Log in to the Firecrawl website to complete registration, get your API Key, and then enter and save it in Dify.

2. Scrape target webpage

On the knowledge base creation page, select Sync from website, choose Firecrawl as the provider, and enter the target URL to be crawled.

The configuration options include: Whether to crawl sub-pages, Page crawling limit, Page scraping max depth, Excluded paths, Include only paths, and Content extraction scope. After completing the configuration, click Run to preview the parsed pages.

3. Review import results

After importing the parsed text from the webpage, it is stored in the knowledge base documents. View the import results and click Add URL to continue importing new web pages.


Jina Reader

1. Configuring Jina Reader Credentials

Click on the avatar in the upper right corner, then go to the DataSource page, and click the Configure button next to Jina Reader.

Log in to the Jina Reader website, complete registration, obtain the API Key, then fill it in and save.

2. Using Jina Reader to Crawl Web Content

On the knowledge base creation page, select Sync from website, choose Jina Reader as the provider, and enter the target URL to be crawled.

Configuration options include: whether to crawl subpages, maximum number of pages to crawl, and whether to use sitemap for crawling. After completing the configuration, click the Run button to preview the page links to be crawled.

Import the parsed text from web pages and store it in the knowledge base documents, then view the import results. To continue adding web pages, click the Add URL button on the right to import new web pages.

After crawling is complete, the content from the web pages will be incorporated into the knowledge base.

Last updated