Botmation Documentation

Scrapers

Scrapers

These BotActions interact with the page through evaluated serialized code. They can scrape or manually run code within the page's context.

The scraper BotActions use a HTML parser to convert the outerHTML property of an HTML Element(s) into an interactive object.

The default HTML parser is cheerio's load() function.

Override the default HTML parser by providing your own function to parse the outerHTML via higher-order param higherOrderHTMLParser or the first inject. When both are present, the higher-order function will be used instead of the injected one.

HTML Parser

This higher-order BotAction will inject the provided HTML parser function of the first htmlParser() call, into the assembled BotAction's of the second call htmlParser()().

A method for BotAction's implementing the ScraperBotAction interface to override the default HTML parser.

Assembled BotAction's can override the injected HTML parser by providing their own via their own optional higher-order param.

const htmlParser = (htmlParser: Function) =>
(...actions: BotAction[]): BotAction =>
pipe()(
inject(htmlParser)(
errors('htmlParser()()')(...actions)
)
)

Example:

await pipe()(
htmlParser(customHTMLParser)(
$('html body .user')
)
)(page)

The assembled $() scraper will use the customHTMLParser function instead of the default.

Scrape Single

This ScraperBotAction will parse the HTML of the first Element in the document that matches the provided selector.

const $ = <R = CheerioStatic>(htmlSelector: string, higherOrderHTMLParser?: Function): ScraperBotAction<R> =>
async(page, injectedHTMLParser) => {
let parser: Function
if (!higherOrderHTMLParser) {
if (injectedHTMLParser) {
parser = injectedHTMLParser
} else {
parser = cheerio.load
}
} else {
parser = higherOrderHTMLParser
}
const scrapedHTML = await page.evaluate(getElementOuterHTML, htmlSelector)
return parser(scrapedHTML)
}

Example:

await pipe()(
goToDashboard, // fake -> load site with "dashboard"
// look at the header for the badge UI representing # of notifications
$('header .user .notifications .badge'),
// by default, it returns a CheerioStatic instance
map(notificationsCount => notificationsCount.text()), // grab the text from the "element"
log('# of Notifications') // Pipe: 2
)(page)

Scrape Multiple

This ScraperBotAction will parse the HTML of all Elements in the document that match the provided selector.

const $$ = <R = CheerioStatic[]>(htmlSelector: string, higherOrderHTMLParser?: Function): ScraperBotAction<R> =>
async(page, ...injects) => {
let parser: Function
// Future support piping the HTML selector with higher-order overriding
const [,injectedHTMLParser] = unpipeInjects(injects, 1)
if (higherOrderHTMLParser) {
parser = higherOrderHTMLParser
} else {
if (injectedHTMLParser) {
parser = injectedHTMLParser
} else {
parser = cheerio.load
}
}
const scrapedHTMLs = await page.evaluate(getElementsOuterHTML, htmlSelector)
const cheerioEls: CheerioStatic[] = scrapedHTMLs.map(scrapedHTML => parser(scrapedHTML))
return cheerioEls as any as R
}

Example:

await pipe()(
goToNewsFeed, // fake -> load site with "news feed"
// selector to grab each "element" from the Feed container
// Each element represents a Post in the Feed with children
// elements representing data like Author, Date, Actual Text, etc
$$('section.app .feed [data-id]'),
// by default, it returns a CheerioStatic[]
forAll()( // loop the CheerioStatic[]
feedPostEl => ([
// scroll and "like" the post IF the post belongs to a friend
// in the `friends` array (ie by name matching post authors)
givenThat(postBelongsTo(feedPostEl, ...friends))(
scrollTo(feedPostEl.attr('id')),
like(feedPostEl)
)
])
)
)(page)

Evaluate

This BotAction isn't a ScraperBotAction, but can be used to scrape and more. It's a simple BotAction that wraps Puppeteer Page's evaluate() method to execute serialized JavaScript in the Page.

const evaluate = (functionToEvaluate: EvaluateFn<any>, ...functionParams: any[]): BotAction<any> =>
async(page) => await page.evaluate(functionToEvaluate, ...functionParams)

The functionToEvaluate is serialized by Puppeteer then injected into the context of the Page to be evaluated there.

Do not reference 3rd party libraries inside evaluated functions since they won't be available in the Page's context.

For an example, see the Navigation BotAction scrollTo()

Text Exists

This Conditional BotAction checks the Puppeteer page for the text provided. If found, it returns true otherwise false.

Checks the document.documentElement for either textContent and innerText matching text provided

const textExists = (text: string): ConditionalBotAction =>
evaluate(textExistsInDocument, text)

For example:

givenThat(textExists('Do you want to save the password?'))(
clickNo
)

Element Exists

This Conditional BotAction checks the Puppeteer page DOM for a HTML Node element that matches the selector provided. If found, it returns true otherwise false.

const elementExists = (elementSelector: string): ConditionalBotAction =>
evaluate(elementExistsInDocument, elementSelector)

For example:

givenThat(elementExists('body .user .notifications'))(
$('body .user .notifications'),
log('Number of User Notifications')
)

This will only attempt to scrape and log the User's notifications, if that element is found within the DOM.

Helpers

In order to "scrape" the page's document's HTML, these Helper functions are evaluated in the page's context. To avoid race conditions with the document's nodes, the outerHTML property (string representation of the HTML element and children) is returned to the NodeJS context than parsed with a HTML parser.

getElementOuterHTML()

const getElementOuterHTML = (htmlSelector: string): string|undefined =>
document.querySelector(htmlSelector)?.outerHTML

getElementsOuterHTML()

const getElementsOuterHTML = (htmlSelector: string): string[] =>
Array.from(document.querySelectorAll(htmlSelector)).map(el => el.outerHTML)

textExistsInDocument()

const textExistsInDocument = (text: string): boolean =>
( document.documentElement.textContent || document.documentElement.innerText )
.indexOf(text) > -1

elementExistsInDocument()

const elementExistsInDocument = (htmlSelector: string): boolean =>
document.querySelector(htmlSelector) !== null
Edit this page on GitHub
Baby Bot