XPath: The perfect DOM selector for scrapers, bots, and tests

In the past couple of years I've found many websites without programmatic access to their information. I've frequently found myself building bots to extract, organize, and analyze that information. As I've built these bots, I've found one - fairly old school - tool that works wonders for picking it out accurately: XPath.

What is XPath?

XPath (XML Path Language) is a language that is used to navigate through the elements and attributes of an XML document. It allows you to select specific elements or attributes based on their element name, attribute value, or other information about the element.

It uses path notation (which is similar to a file system path) to identify the elements and attributes in an XML document. For example, the path //element/subelement selects all subelement elements that are children of element elements, regardless of where they appear in the document.

XPath is not only a W3C recommendation but also a major component of the XLST standard which basically means it's here to stay.

More importantly, it has a much more robust set of functions than CSS that make it easy to select elements even by text content.

How can I use it in my browser?

Real simple: $x(xpathSelector). So for example, if you needed to access a div with the word hello on it, you could very easily just do $x('//div[contains(.,"hello")]'). I know, it's very powerful.

If you want to see more examples on how to use it, look at w3schools.com and this useful cheatsheet.

Where should I use this?

Web Scraping. XPath can be used to extract information from web pages that are stored in HTML or XML format and is arguably easier to use.

End-to-end testing. XPath is also great at reviewing DOM state without adding any noise to your rendered DOM code.

Not for web development. The language is apparently available pretty much in every browser except IE according to caniuse.com but you should consider using it sparingly for web development because it is slower. Nowadays most if not all of the tracking for DOM elements is taken care for us via the virtual DOM that tools like React provide.