Extracting multiple documents from one page in Sitecore Search

Extracting multiple documents from one page in Sitecore Search

In many cases, Sitecore Search is used to extract one document per page, which works perfectly for standard content structures — blog posts, product pages, articles, and so on. But what happens when you have a single web page that includes multiple pieces of structured content you want to index individually?

We encountered this situation with product family pages, where multiple products are marketed on one page but should be discoverable as individual products in website search.

Covering the Basics

In this blog post, we create an advanced web crawler. If this is your first Sitecore Search crawler, you might want to consider to read this blog post first.

Your first Sitecore Search Crawler
You just got access to Sitecore Search. Full of energy, you login for the first time and you stare at the screen with absolutely no idea what to do next. This was me, just a few months ago. In the meantime, I spent quite some time with the tool and

At this point, you should have your source in Sitecore Search set up and configured. The focus for getting multiple documents from a single page will be in the JavaScript document extractor.

Writing the Document Extractor

The key to achieving our goal is the fact that a function in the tagger of a JavaScript document extractor returns an array. While the function is called for every page that’s crawled, we can return as many results to the search index as we want.

💡
If you return multiple objects from a content tagger, they must have unique IDs.

We start off with the function signature and define the results array, which we’ll fill during execution. Additionally, we extract a content type ID from the page metadata to identify Product & Service pages.

function extract(request, response) {
  $ = response.body;

  var result = [];
  
  var pageUrl = $('meta[property="og:url"]').attr('content')?.toLowerCase();
  var pageContentTypeId = $('meta[name="content_type_id"]').attr('content') || 'content';
  
  [...]
}

The code includes a switch statement to handle different content types. In general, a Product & Service page represents a standalone product or service. Only when the page contains a Tab Navigation component can we assume it’s a Product Family page, where the individual tabs represent sub-products.

switch (pageContentTypeId) {

  [...]
    
  case 'product_services':
    $productTabs = $('#content')
      .find('div.tab-navigation')
      .find('div.tab-content')
      .children('div');
    if ($productTabs.length > 0) {
      // When there are tabs, then it's a sub product or service.
      // We set an constant product family name
      // so that we can use it to prefilter the search results
      productFamilyName = 'Product Family';
      $productTabs.each((i, elem) => {
        var productTabTitle = $(elem).attr('title');
        var productTabImageUrl = $(elem).find('img').attr('src');

        // For sub products and services, we want to use
        // the first header in the tab as the description
        var productTabDescription = $(elem).find('h2, h3, h4').first().text();

        var tabid = $(elem).attr('id');
        let regex = /.*?-tabpane-/g;
        var linkid = tabid.replace(regex, '');
        var linkhref = $(elem).data('href') || '#' + linkid;
        var productTabUrl = pageUrl + linkhref;
        productTabUrl = productTabUrl?.toLowerCase();

        // Add the sub product to the results
        result.push({
          id: tabid.replaceAll(/[:\.?\-/#]+/g, '_'),
          url: productTabUrl,
          page_url: pageUrl,
          type_id: pageContentTypeId,
          name: productTabTitle,
          description: productTabDescription,
          image_url: productTabImageUrl
        });
      });
    }
    break;

    [...]

    return results;
}

The additional products are pushed into the result array and will be indexed as individual documents. Be sure each item has a unique ID, or the indexing will fail.

Checking the End Result

When we now search for rocorr on our website, the search results list the product family page first, followed by the individual products from that page. Each result links directly to the relevant tab within the page via a deep link.

Navigating to a search result will show the product listed in the tab component.

Bonus - Search powered components

Now that all products are indexed, we can also create components powered by Sitecore Search. On our website, we built a product finder application that uses Sitecore Search as its data source. This allows us to leverage the full power of search — personalization, facets, filtering, and more.

Summary

Extracting multiple documents from one page is not much more complex than extracting a single document. The JavaScript-based advanced web crawler provides full control over what gets indexed and supports multiple languages.

If you want to learn more about multilingual support in Sitecore Search, consider reading this blog post next.

Dealing with multiple languages in Sitecore Search with XM Cloud
Unless you have been replaced by AI, there is a chance that you will be the one to add multiple languages to your Sitecore website. When you want to integrate this with Sitecore Search, there are a few things you need to know. Let’s try to see how we can