Extracting multiple documents from one page in Sitecore Search

In many cases, Sitecore Search is used to extract one document per page, which works perfectly for standard content structures — blog posts, product pages, articles, and so on. But what happens when you have a single web page that includes multiple pieces of structured content you want to index individually?
We encountered this situation with product family pages, where multiple products are marketed on one page but should be discoverable as individual products in website search.
Covering the Basics
In this blog post, we create an advanced web crawler. If this is your first Sitecore Search crawler, you might want to consider to read this blog post first.

At this point, you should have your source in Sitecore Search set up and configured. The focus for getting multiple documents from a single page will be in the JavaScript document extractor.
Writing the Document Extractor
The key to achieving our goal is the fact that a function in the tagger of a JavaScript document extractor returns an array. While the function is called for every page that’s crawled, we can return as many results to the search index as we want.
We start off with the function signature and define the results
array, which we’ll fill during execution. Additionally, we extract a content type ID from the page metadata to identify Product & Service pages.
function extract(request, response) {
$ = response.body;
var result = [];
var pageUrl = $('meta[property="og:url"]').attr('content')?.toLowerCase();
var pageContentTypeId = $('meta[name="content_type_id"]').attr('content') || 'content';
[...]
}
The code includes a switch
statement to handle different content types. In general, a Product & Service page represents a standalone product or service. Only when the page contains a Tab Navigation component can we assume it’s a Product Family page, where the individual tabs represent sub-products.
switch (pageContentTypeId) {
[...]
case 'product_services':
$productTabs = $('#content')
.find('div.tab-navigation')
.find('div.tab-content')
.children('div');
if ($productTabs.length > 0) {
// When there are tabs, then it's a sub product or service.
// We set an constant product family name
// so that we can use it to prefilter the search results
productFamilyName = 'Product Family';
$productTabs.each((i, elem) => {
var productTabTitle = $(elem).attr('title');
var productTabImageUrl = $(elem).find('img').attr('src');
// For sub products and services, we want to use
// the first header in the tab as the description
var productTabDescription = $(elem).find('h2, h3, h4').first().text();
var tabid = $(elem).attr('id');
let regex = /.*?-tabpane-/g;
var linkid = tabid.replace(regex, '');
var linkhref = $(elem).data('href') || '#' + linkid;
var productTabUrl = pageUrl + linkhref;
productTabUrl = productTabUrl?.toLowerCase();
// Add the sub product to the results
result.push({
id: tabid.replaceAll(/[:\.?\-/#]+/g, '_'),
url: productTabUrl,
page_url: pageUrl,
type_id: pageContentTypeId,
name: productTabTitle,
description: productTabDescription,
image_url: productTabImageUrl
});
});
}
break;
[...]
return results;
}
The additional products are pushed into the result
array and will be indexed as individual documents. Be sure each item has a unique ID, or the indexing will fail.
Checking the End Result
When we now search for rocorr on our website, the search results list the product family page first, followed by the individual products from that page. Each result links directly to the relevant tab within the page via a deep link.

Navigating to a search result will show the product listed in the tab component.

Bonus - Search powered components
Now that all products are indexed, we can also create components powered by Sitecore Search. On our website, we built a product finder application that uses Sitecore Search as its data source. This allows us to leverage the full power of search — personalization, facets, filtering, and more.

Summary
Extracting multiple documents from one page is not much more complex than extracting a single document. The JavaScript-based advanced web crawler provides full control over what gets indexed and supports multiple languages.
If you want to learn more about multilingual support in Sitecore Search, consider reading this blog post next.
