{"id":1321,"date":"2022-01-07T05:30:12","date_gmt":"2022-01-07T05:30:12","guid":{"rendered":"https:\/\/multisite.korebots.com\/SearchAssist\/?p=1321"},"modified":"2022-02-24T11:47:38","modified_gmt":"2022-02-24T11:47:38","slug":"how-to-add-content-by-web-crawling","status":"publish","type":"post","link":"https:\/\/multisite.korebots.com\/SearchAssist\/concepts\/content-sources\/crawling-webpages\/how-to-add-content-by-web-crawling\/","title":{"rendered":"Adding Content by Web Crawling"},"content":{"rendered":"<section class=\"l-section wpb_row height_auto\"><div class=\"l-section-h i-cf\"><div class=\"g-cols vc_row via_grid cols_1 laptops-cols_inherit tablets-cols_inherit mobiles-cols_1 valign_top type_default stacking_default\"><div class=\"wpb_column vc_column_container\"><div class=\"vc_column-inner\"><div class=\"wpb_text_column\"><div class=\"wpb_wrapper\"><h3><span class=\"ez-toc-section\" id=\"Introduction\"><\/span><span style=\"font-weight: 400;\">Introduction<\/span><span class=\"ez-toc-section-end\"><\/span><\/h3>\n<p><span style=\"font-weight: 400;\">Organizations usually have web pages with content that lists their offerings such as product information or process knowledge which a user can query.\u00a0 As a business user you can\u00a0 leverage this content by mapping Search Assistant to it, and process it\u00a0 to be available for user\u00a0 queries. <\/span><\/p>\n<p><span style=\"font-weight: 400;\">SearchAssist enables you to ingest content into your Search Assistant\u00a0 through web crawling. For example, consider a banking website. It has ready information to answer most of the search user queries. In this scenario, the Search Assistant is configured to crawl the bank\u2019s website and index all the web pages so that the indexed pages are retrieved to answer the search users\u2019 queries.\u00a0<\/span><\/p>\n<p><span style=\"font-weight: 400;\">SearchAssist allows you to schedule automated web crawling sessions of target URLs at required frequency or a desired time window.\u00a0<\/span><\/p>\n<h3><span class=\"ez-toc-section\" id=\"Adding_Content_by_Web_Crawling\"><\/span><span style=\"font-weight: 400;\">Adding Content by Web Crawling<\/span><span class=\"ez-toc-section-end\"><\/span><\/h3>\n<p><span style=\"font-weight: 400;\">To crawl web domains, take the following steps:<\/span><\/p>\n<ul>\n<li style=\"font-weight: 400;\"><span style=\"font-weight: 400;\">Log in to the application with valid credentials<\/span><\/li>\n<li style=\"font-weight: 400;\"><span style=\"font-weight: 400;\">Click the <\/span><b>Indices <\/b><span style=\"font-weight: 400;\">tab on the top<\/span><\/li>\n<li style=\"font-weight: 400;\"><span style=\"font-weight: 400;\">On the left pane, under the <\/span><b>Sources <\/b><span style=\"font-weight: 400;\">section, click <\/span><b>Content<\/b><\/li>\n<li style=\"font-weight: 400;\"><span style=\"font-weight: 400;\">In the <strong>+Add Content<\/strong> dropdown, click <\/span><b>Crawl Web Domain<\/b><\/li>\n<li style=\"font-weight: 400;\"><span style=\"font-weight: 400;\">On the Crawl Web Domain dialog box, enter the domain URL in the <\/span><b>Source URL<\/b><span style=\"font-weight: 400;\"> field<\/span><\/li>\n<li style=\"font-weight: 400;\">Enter a name in the <b>Source Title<\/b> field and a description in the <b>Description <\/b>field<\/li>\n<\/ul>\n<p><a ref=\"magnificPopup\" href=\"https:\/\/multisite.korebots.com\/SearchAssist\/wp-content\/uploads\/sites\/18\/2021\/12\/Manage_Sources_webcrawling2.png\"><img loading=\"lazy\" decoding=\"async\" class=\"alignnone wp-image-1447 size-large\" src=\"https:\/\/multisite.korebots.com\/SearchAssist\/wp-content\/uploads\/sites\/18\/2021\/12\/Manage_Sources_webcrawling2-1024x565.png\" alt=\"\" width=\"640\" height=\"353\" srcset=\"https:\/\/multisite.korebots.com\/SearchAssist\/wp-content\/uploads\/sites\/18\/2021\/12\/Manage_Sources_webcrawling2-1024x565.png 1024w, https:\/\/multisite.korebots.com\/SearchAssist\/wp-content\/uploads\/sites\/18\/2021\/12\/Manage_Sources_webcrawling2-300x165.png 300w, https:\/\/multisite.korebots.com\/SearchAssist\/wp-content\/uploads\/sites\/18\/2021\/12\/Manage_Sources_webcrawling2.png 1523w\" sizes=\"(max-width: 640px) 100vw, 640px\" \/><\/a><\/p>\n<ul>\n<li style=\"font-weight: 400;\"><span style=\"font-weight: 400;\">To schedule <\/span><span style=\"font-weight: 400;\">frequent web crawls<\/span><span style=\"font-weight: 400;\">, under the <\/span><b>Schedule <\/b><span style=\"font-weight: 400;\">section, turn the toggle on\u00a0<\/span><\/li>\n<li style=\"font-weight: 400;\"><span style=\"font-weight: 400;\">Set the Start <\/span><b>Date<\/b><span style=\"font-weight: 400;\"> and <\/span><b>Time<\/b><span style=\"font-weight: 400;\">, and <\/span><b>Frequency<\/b><span style=\"font-weight: 400;\"> at which the crawl needs to be scheduled <\/span>\n<ul>\n<li style=\"font-weight: 400;\"><span style=\"font-weight: 400;\">Select the Time zone from the drop down.\u00a0 <\/span>SearchAssist currently supports only 3 time zones IST, EST, and UTC<a ref=\"magnificPopup\" href=\"https:\/\/multisite.korebots.com\/SearchAssist\/wp-content\/uploads\/sites\/18\/2021\/12\/Manage_Sources_webcrawling_sceduling_time-zones-1.png\"><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter wp-image-1484 size-medium\" src=\"https:\/\/multisite.korebots.com\/SearchAssist\/wp-content\/uploads\/sites\/18\/2021\/12\/Manage_Sources_webcrawling_sceduling_time-zones-1-300x79.png\" alt=\"\" width=\"300\" height=\"79\" srcset=\"https:\/\/multisite.korebots.com\/SearchAssist\/wp-content\/uploads\/sites\/18\/2021\/12\/Manage_Sources_webcrawling_sceduling_time-zones-1-300x79.png 300w, https:\/\/multisite.korebots.com\/SearchAssist\/wp-content\/uploads\/sites\/18\/2021\/12\/Manage_Sources_webcrawling_sceduling_time-zones-1.png 980w\" sizes=\"(max-width: 300px) 100vw, 300px\" \/><\/a><\/li>\n<\/ul>\n<\/li>\n<\/ul>\n<ul>\n<li style=\"list-style-type: none;\">\n<ul>\n<li style=\"font-weight: 400;\"><span style=\"font-weight: 400;\">From the <\/span><b>Frequency <\/b><span style=\"font-weight: 400;\">dropdown\u00a0 either select a one time that <\/span><b>Does not repeat<\/b><span style=\"font-weight: 400;\">, or select <\/span><b>Daily<\/b><span style=\"font-weight: 400;\">, <\/span><b>Weekly<\/b><span style=\"font-weight: 400;\">, <\/span><b>Monthly<\/b><span style=\"font-weight: 400;\">, <\/span><b>Annually <\/b><span style=\"font-weight: 400;\">or a <\/span><b>Custom\u00a0 Recurrence<\/b><span style=\"font-weight: 400;\"> for the Crawling to be triggered and the specific time selected <\/span><span style=\"font-weight: 400;\">in the previous steps<a ref=\"magnificPopup\" href=\"https:\/\/multisite.korebots.com\/SearchAssist\/wp-content\/uploads\/sites\/18\/2021\/12\/Manage_Sources_webcrawling_frequency.png\"><img loading=\"lazy\" decoding=\"async\" class=\"wp-image-1449 size-medium aligncenter\" src=\"https:\/\/multisite.korebots.com\/SearchAssist\/wp-content\/uploads\/sites\/18\/2021\/12\/Manage_Sources_webcrawling_frequency-300x237.png\" alt=\"\" width=\"300\" height=\"237\" srcset=\"https:\/\/multisite.korebots.com\/SearchAssist\/wp-content\/uploads\/sites\/18\/2021\/12\/Manage_Sources_webcrawling_frequency-300x237.png 300w, https:\/\/multisite.korebots.com\/SearchAssist\/wp-content\/uploads\/sites\/18\/2021\/12\/Manage_Sources_webcrawling_frequency.png 495w\" sizes=\"(max-width: 300px) 100vw, 300px\" \/><\/a><\/span><\/li>\n<\/ul>\n<\/li>\n<\/ul>\n<ul>\n<li style=\"list-style-type: none;\">\n<ul>\n<li><b><span style=\"font-weight: 400;\">To finetune the custom frequency more precisely use the <\/span>Custom Recurrence<span style=\"font-weight: 400;\"> that allows you to\u00a0 set frequency to <\/span>Repeat Every<span style=\"font-weight: 400;\">\u00a0 certain number of times, say every <\/span>Day, Week, or Month or Year<span style=\"font-weight: 400;\">, to <\/span>Repeat On<span style=\"font-weight: 400;\"> certain days of the week <\/span><span style=\"font-weight: 400;\">without giving it a miss<\/span><span style=\"font-weight: 400;\"><a ref=\"magnificPopup\" href=\"https:\/\/multisite.korebots.com\/SearchAssist\/wp-content\/uploads\/sites\/18\/2021\/12\/Manage_Sources_webcrawling_sceduling2.png\"><img loading=\"lazy\" decoding=\"async\" class=\"wp-image-1450 size-medium aligncenter\" src=\"https:\/\/multisite.korebots.com\/SearchAssist\/wp-content\/uploads\/sites\/18\/2021\/12\/Manage_Sources_webcrawling_sceduling2-284x300.png\" alt=\"\" width=\"284\" height=\"300\" srcset=\"https:\/\/multisite.korebots.com\/SearchAssist\/wp-content\/uploads\/sites\/18\/2021\/12\/Manage_Sources_webcrawling_sceduling2-284x300.png 284w, https:\/\/multisite.korebots.com\/SearchAssist\/wp-content\/uploads\/sites\/18\/2021\/12\/Manage_Sources_webcrawling_sceduling2.png 568w\" sizes=\"(max-width: 284px) 100vw, 284px\" \/><\/a><\/span><\/b><b><\/b>\n<ul>\n<li><span style=\"font-weight: 400;\">Choose to apply the custom recurrence either for an <\/span><span style=\"font-weight: 400;\">\u00a0<\/span>\n<ul>\n<li><b><\/b><b><\/b><span style=\"font-weight: 400;\">indefinite period<\/span> <span style=\"font-weight: 400;\">to<\/span><b> never end <\/b><span style=\"font-weight: 400;\">or for a time span to: <\/span>\n<ul>\n<li style=\"font-weight: 400;\"><span style=\"font-weight: 400;\">end <\/span><b>On <\/b><span style=\"font-weight: 400;\">a certain date to be selected from the date selector, or<\/span><\/li>\n<li><span style=\"font-weight: 400;\">end <\/span><b>At <\/b><span style=\"font-weight: 400;\">(after) a certain number of Occurrences you enter in the numbers field next to it<\/span><\/li>\n<\/ul>\n<\/li>\n<\/ul>\n<\/li>\n<\/ul>\n<\/li>\n<\/ul>\n<\/li>\n<\/ul>\n<ul>\n<li style=\"font-weight: 400;\"><span style=\"font-weight: 400;\">In the <\/span><b>Crawl Options<\/b><span style=\"font-weight: 400;\"> list, select an option from the dropdown list:<\/span>\n<ul>\n<li style=\"font-weight: 400;\"><i><span style=\"font-weight: 400;\">Crawl Everything<\/span><\/i><span style=\"font-weight: 400;\"> \u2013 To enable crawling all the URLs that belong to the web domain<\/span><\/li>\n<li style=\"font-weight: 400;\"><i><span style=\"font-weight: 400;\">Crawl Everything Except Specific URLs<\/span><\/i><span style=\"font-weight: 400;\"> \u2013 To list down the URLs within the web domain that you want to ignore from crawling<\/span><\/li>\n<li style=\"font-weight: 400;\"><i><span style=\"font-weight: 400;\">Crawl Only Specific URLs<\/span><\/i><span style=\"font-weight: 400;\"> \u2013 To list down only the URLs that you want to crawl from the web domain<\/span><\/li>\n<\/ul>\n<\/li>\n<li style=\"font-weight: 400;\"><span style=\"font-weight: 400;\">Apply <\/span><b>Crawl Settings<\/b><span style=\"font-weight: 400;\"> as per your requirements:<\/span>\n<ul>\n<li style=\"font-weight: 400;\"><span style=\"font-weight: 400;\">Select <\/span><i><span style=\"font-weight: 400;\">Javascript-rendered<\/span><\/i><span style=\"font-weight: 400;\"><span style=\"font-weight: 400;\"> to allow crawling of websites with content rendered through JS code<a ref=\"magnificPopup\" href=\"https:\/\/multisite.korebots.com\/SearchAssist\/wp-content\/uploads\/sites\/18\/2022\/01\/crawl-options-javascript_conditions.png\"><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter wp-image-1901 size-full\" src=\"https:\/\/multisite.korebots.com\/SearchAssist\/wp-content\/uploads\/sites\/18\/2022\/01\/crawl-options-javascript_conditions.png\" alt=\"\" width=\"789\" height=\"45\" srcset=\"https:\/\/multisite.korebots.com\/SearchAssist\/wp-content\/uploads\/sites\/18\/2022\/01\/crawl-options-javascript_conditions.png 789w, https:\/\/multisite.korebots.com\/SearchAssist\/wp-content\/uploads\/sites\/18\/2022\/01\/crawl-options-javascript_conditions-300x17.png 300w\" sizes=\"(max-width: 789px) 100vw, 789px\" \/><\/a><\/span><\/span>Note: In cases where the target website has\u00a0 javascript enabled pages, those pages to be considered for crawling. If unchecked they will be ignored.<\/li>\n<\/ul>\n<\/li>\n<\/ul>\n<ul>\n<li style=\"list-style-type: none;\">\n<ul>\n<li style=\"font-weight: 400;\"><i><span style=\"font-weight: 400;\">Crawl Beyond Sitemap<\/span><\/i><span style=\"font-weight: 400;\"> \u2013 <\/span><span style=\"font-weight: 400;\">Within the given domain<\/span><span style=\"font-weight: 400;\"> selecting this feature allows crawling the web pages above and beyond the URLs that are provided in the sitemap file of the target website. <\/span><span style=\"font-weight: 400;\">To restrict the crawling limited to site map unselect <\/span><i><span style=\"font-weight: 400;\">Crawl Beyond Sitemap<\/span><\/i><\/li>\n<li style=\"font-weight: 400;\"><i><span style=\"font-weight: 400;\">Use Cookies<\/span><\/i><span style=\"font-weight: 400;\"> \u2013 allow crawling the web pages that require cookie acceptance. <\/span><span style=\"font-weight: 400;\">Unselect\u00a0 to ignore web pages that require Cookie Acceptance\u00a0<\/span><\/li>\n<li style=\"font-weight: 400;\"><i><span style=\"font-weight: 400;\">Respect robots.txt<\/span><\/i><span style=\"font-weight: 400;\"> \u2013 to honor any directives from the robots.txt file for the web domain\u00a0<\/span><\/li>\n<li style=\"font-weight: 400;\"><i><span style=\"font-weight: 400;\">Depth of Crawling: <\/span><\/i><span style=\"font-weight: 400;\">The homepage at the top of the site page hierarchy, and inner pages linked from it in the\u00a0 lower levels, then the crawl depth specifies how deep into those nested levels the crawler will reach<\/span><\/li>\n<li style=\"font-weight: 400;\"><i><span style=\"font-weight: 400;\">Scope of Crawling:\u00a0 <\/span><\/i><span style=\"font-weight: 400;\">\u00a01 to maximum number of URLs allowed to be Crawled\u00a0<\/span><\/li>\n<li style=\"font-weight: 400;\"><i><span style=\"font-weight: 400;\">Crawl Depth<\/span><\/i><span style=\"font-weight: 400;\"> \u2013 The maximum depth allowed to crawl any site <\/span><span style=\"font-weight: 400;\">can be specified<\/span><span style=\"font-weight: 400;\">, the value 0 indicates no limit\u00a0<\/span><\/li>\n<li style=\"font-weight: 400;\"><i><span style=\"font-weight: 400;\">Max URL Limit<\/span><\/i><span style=\"font-weight: 400;\"> \u2013 The maximum number of URLs to be crawled <\/span><span style=\"font-weight: 400;\">can be specified<\/span><span style=\"font-weight: 400;\">, the value 0 indicates no limit<\/span><\/li>\n<\/ul>\n<\/li>\n<li><span style=\"font-weight: 400;\">Click <\/span><b>Proceed<\/b><\/li>\n<\/ul>\n<h3><span class=\"ez-toc-section\" id=\"Errors_in_Adding_Content_by_Web-Crawling\"><\/span>Errors in Adding Content by Web-Crawling<span class=\"ez-toc-section-end\"><\/span><\/h3>\n<ul>\n<li style=\"list-style-type: none;\"><span style=\"font-weight: 400;\">Web Crawling feature can fail in two stages:\u00a0<\/span>\n<ol>\n<li style=\"font-weight: 400;\"><span style=\"font-weight: 400;\">Fails to start\u00a0 due to URL validation failure <\/span>either due to connectivity issues or a misspelt\u00a0 URL, click<b> Retry<\/b> or <b>Edit Configuration <\/b>to Edit URL\u00a0 <a ref=\"magnificPopup\" href=\"https:\/\/multisite.korebots.com\/SearchAssist\/wp-content\/uploads\/sites\/18\/2021\/12\/Errors_web-crawling1.png\"><img loading=\"lazy\" decoding=\"async\" class=\"wp-image-1455 aligncenter\" src=\"https:\/\/multisite.korebots.com\/SearchAssist\/wp-content\/uploads\/sites\/18\/2021\/12\/Errors_web-crawling1.png\" alt=\"\" width=\"252\" height=\"179\" srcset=\"https:\/\/multisite.korebots.com\/SearchAssist\/wp-content\/uploads\/sites\/18\/2021\/12\/Errors_web-crawling1.png 425w, https:\/\/multisite.korebots.com\/SearchAssist\/wp-content\/uploads\/sites\/18\/2021\/12\/Errors_web-crawling1-300x213.png 300w\" sizes=\"(max-width: 252px) 100vw, 252px\" \/><\/a><\/li>\n<li style=\"font-weight: 400;\"><span style=\"font-weight: 400;\">Fails after starting successful URL Validation for the given website and during web crawling. <a ref=\"magnificPopup\" href=\"https:\/\/multisite.korebots.com\/SearchAssist\/wp-content\/uploads\/sites\/18\/2021\/12\/Errors_web-crawling2.png\"><img loading=\"lazy\" decoding=\"async\" class=\"wp-image-1456 aligncenter\" src=\"https:\/\/multisite.korebots.com\/SearchAssist\/wp-content\/uploads\/sites\/18\/2021\/12\/Errors_web-crawling2.png\" alt=\"\" width=\"293\" height=\"181\" srcset=\"https:\/\/multisite.korebots.com\/SearchAssist\/wp-content\/uploads\/sites\/18\/2021\/12\/Errors_web-crawling2.png 535w, https:\/\/multisite.korebots.com\/SearchAssist\/wp-content\/uploads\/sites\/18\/2021\/12\/Errors_web-crawling2-300x186.png 300w\" sizes=\"(max-width: 293px) 100vw, 293px\" \/><\/a><\/span><\/li>\n<\/ol>\n<p><span style=\"font-weight: 400;\">Note: If you are attempting to crawl the same site without enabling automatic Frequent Scheduling toggle, a Duplication warning message pops up, reading \u201cWeb crawling cannot be duplicated\u201d instead try using the Crawl by schedule.<\/span><span style=\"font-weight: 400;\">\u00a0<\/span><\/p>\n<p>&nbsp;<\/li>\n<\/ul>\n<\/div><\/div><\/div><\/div><\/div><\/div><\/section><section class=\"l-section wpb_row height_auto\"><div class=\"l-section-h i-cf\"><div class=\"g-cols vc_row via_grid cols_1 laptops-cols_inherit tablets-cols_inherit mobiles-cols_1 valign_top type_default stacking_default\"><div class=\"wpb_column vc_column_container\"><div class=\"vc_column-inner\"><div class=\"w-post-elm post_navigation layout_simple inv_false\"><a class=\"post_navigation-item order_first to_prev\" href=\"https:\/\/multisite.korebots.com\/SearchAssist\/concepts\/content-sources\/manage-sources\/\" title=\"Overview\"><div class=\"post_navigation-item-arrow\"><\/div><div class=\"post_navigation-item-meta\">Previous Post<\/div><div class=\"post_navigation-item-title\"><span>Overview<\/span><\/div><\/a><a class=\"post_navigation-item order_second to_next\" href=\"https:\/\/multisite.korebots.com\/SearchAssist\/concepts\/content-sources\/unstructured-data\/how-to-add-content-by-file-upload\/\" title=\"Adding Content by File Upload\"><div class=\"post_navigation-item-arrow\"><\/div><div class=\"post_navigation-item-meta\">Next Post<\/div><div class=\"post_navigation-item-title\"><span>Adding Content by File Upload<\/span><\/div><\/a><\/div><\/div><\/div><\/div><\/div><\/section>\n","protected":false},"excerpt":{"rendered":"Introduction Organizations usually have web pages with content that lists their offerings such as product information or process knowledge which a user can query.\u00a0 As a business user you can\u00a0 leverage this content by mapping Search Assistant to it, and process it\u00a0 to be available for user\u00a0 queries. SearchAssist enables you to ingest content into...","protected":false},"author":18,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[74],"tags":[],"jetpack_featured_media_url":"","_links":{"self":[{"href":"https:\/\/multisite.korebots.com\/SearchAssist\/wp-json\/wp\/v2\/posts\/1321"}],"collection":[{"href":"https:\/\/multisite.korebots.com\/SearchAssist\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/multisite.korebots.com\/SearchAssist\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/multisite.korebots.com\/SearchAssist\/wp-json\/wp\/v2\/users\/18"}],"replies":[{"embeddable":true,"href":"https:\/\/multisite.korebots.com\/SearchAssist\/wp-json\/wp\/v2\/comments?post=1321"}],"version-history":[{"count":25,"href":"https:\/\/multisite.korebots.com\/SearchAssist\/wp-json\/wp\/v2\/posts\/1321\/revisions"}],"predecessor-version":[{"id":3641,"href":"https:\/\/multisite.korebots.com\/SearchAssist\/wp-json\/wp\/v2\/posts\/1321\/revisions\/3641"}],"wp:attachment":[{"href":"https:\/\/multisite.korebots.com\/SearchAssist\/wp-json\/wp\/v2\/media?parent=1321"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/multisite.korebots.com\/SearchAssist\/wp-json\/wp\/v2\/categories?post=1321"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/multisite.korebots.com\/SearchAssist\/wp-json\/wp\/v2\/tags?post=1321"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}