Botmation Documentation
Scrapers
Scrapers
These BotActions interact with the page through evaluated serialized code. They can scrape or manually run code within the page's context.
The scraper BotActions use a HTML parser to convert the outerHTML
property of an HTML Element(s) into an interactive object.
The default HTML parser is cheerio's load() function.
Override the default HTML parser by providing your own function to parse the
outerHTML
via higher-order paramhigherOrderHTMLParser
or the first inject. When both are present, the higher-order function will be used instead of the injected one.
HTML Parser
This higher-order BotAction will inject the provided HTML parser function of the first htmlParser()
call, into the assembled BotAction's of the second call htmlParser()()
.
A method for BotAction's implementing the ScraperBotAction
interface to override the default HTML parser.
Assembled BotAction's can override the injected HTML parser by providing their own via their own optional higher-order param.
const htmlParser = (htmlParser: Function) => (...actions: BotAction[]): BotAction => pipe()( inject(htmlParser)( errors('htmlParser()()')(...actions) ) )
Example:
await pipe()( htmlParser(customHTMLParser)( $('html body .user') ))(page)
The assembled $()
scraper will use the customHTMLParser
function instead of the default.
Scrape Single
This ScraperBotAction will parse the HTML of the first Element in the document that matches the provided selector.
const $ = <R = CheerioStatic>(htmlSelector: string, higherOrderHTMLParser?: Function): ScraperBotAction<R> => async(page, injectedHTMLParser) => { let parser: Function
if (!higherOrderHTMLParser) { if (injectedHTMLParser) { parser = injectedHTMLParser } else { parser = cheerio.load } } else { parser = higherOrderHTMLParser }
const scrapedHTML = await page.evaluate(getElementOuterHTML, htmlSelector) return parser(scrapedHTML) }
Example:
await pipe()( goToDashboard, // fake -> load site with "dashboard" // look at the header for the badge UI representing # of notifications $('header .user .notifications .badge'), // by default, it returns a CheerioStatic instance map(notificationsCount => notificationsCount.text()), // grab the text from the "element" log('# of Notifications') // Pipe: 2)(page)
Scrape Multiple
This ScraperBotAction will parse the HTML of all Elements in the document that match the provided selector.
const $$ = <R = CheerioStatic[]>(htmlSelector: string, higherOrderHTMLParser?: Function): ScraperBotAction<R> => async(page, ...injects) => { let parser: Function
// Future support piping the HTML selector with higher-order overriding const [,injectedHTMLParser] = unpipeInjects(injects, 1)
if (higherOrderHTMLParser) { parser = higherOrderHTMLParser } else { if (injectedHTMLParser) { parser = injectedHTMLParser } else { parser = cheerio.load } }
const scrapedHTMLs = await page.evaluate(getElementsOuterHTML, htmlSelector) const cheerioEls: CheerioStatic[] = scrapedHTMLs.map(scrapedHTML => parser(scrapedHTML)) return cheerioEls as any as R }
Example:
await pipe()( goToNewsFeed, // fake -> load site with "news feed" // selector to grab each "element" from the Feed container // Each element represents a Post in the Feed with children // elements representing data like Author, Date, Actual Text, etc $$('section.app .feed [data-id]'), // by default, it returns a CheerioStatic[] forAll()( // loop the CheerioStatic[] feedPostEl => ([ // scroll and "like" the post IF the post belongs to a friend // in the `friends` array (ie by name matching post authors) givenThat(postBelongsTo(feedPostEl, ...friends))( scrollTo(feedPostEl.attr('id')), like(feedPostEl) ) ]) ))(page)
Evaluate
This BotAction isn't a ScraperBotAction, but can be used to scrape and more. It's a simple BotAction that wraps Puppeteer Page's evaluate() method to execute serialized JavaScript in the Page.
const evaluate = (functionToEvaluate: EvaluateFn<any>, ...functionParams: any[]): BotAction<any> => async(page) => await page.evaluate(functionToEvaluate, ...functionParams)
The functionToEvaluate
is serialized by Puppeteer then injected into the context of the Page to be evaluated there.
Do not reference 3rd party libraries inside evaluated functions since they won't be available in the Page's context.
For an example, see the Navigation BotAction scrollTo()
Text Exists
This Conditional BotAction checks the Puppeteer page for the text provided. If found, it returns true
otherwise false
.
Checks the
document.documentElement
for eithertextContent
andinnerText
matching text provided
const textExists = (text: string): ConditionalBotAction => evaluate(textExistsInDocument, text)
For example:
givenThat(textExists('Do you want to save the password?'))( clickNo)
Element Exists
This Conditional BotAction checks the Puppeteer page DOM for a HTML Node element that matches the selector provided. If found, it returns true
otherwise false
.
const elementExists = (elementSelector: string): ConditionalBotAction => evaluate(elementExistsInDocument, elementSelector)
For example:
givenThat(elementExists('body .user .notifications'))( $('body .user .notifications'), log('Number of User Notifications'))
This will only attempt to scrape and log the User's notifications, if that element is found within the DOM.
Helpers
In order to "scrape" the page's document's HTML, these Helper functions are evaluated in the page's context. To avoid race conditions with the document's nodes, the outerHTML
property (string representation of the HTML element and children) is returned to the NodeJS context than parsed with a HTML parser.
getElementOuterHTML()
const getElementOuterHTML = (htmlSelector: string): string|undefined => document.querySelector(htmlSelector)?.outerHTML
getElementsOuterHTML()
const getElementsOuterHTML = (htmlSelector: string): string[] => Array.from(document.querySelectorAll(htmlSelector)).map(el => el.outerHTML)
textExistsInDocument()
const textExistsInDocument = (text: string): boolean => ( document.documentElement.textContent || document.documentElement.innerText ) .indexOf(text) > -1
elementExistsInDocument()
const elementExistsInDocument = (htmlSelector: string): boolean => document.querySelector(htmlSelector) !== null