There has been a video making rounds on Twitter and other social platforms showcasing the use of the “import website data” feature inside Microsoft Excel.
But, as it happens this feature is also available inside Google Sheets, so I thought I’d make a write-up on it to explain a little bit more about how it works and how to use it.
The feature to scrape (import) website data in Google Sheets is called ImportHTML.
An alternative function you can use is ImportXML which supports importing data into Google Sheets from external URLs in XML, HTML, CSV, TSV, RSS, and ATOM XML format.
In order to use ImportHTML/XML you need to use XPath as part of the query:
IMPORTHTML("URL", "Query", "Index")
IMPORTXML("URL", "XPath Query")
If you want to follow along you can start a new spreadsheet by going to sheets.new.
How to access ImportXML/HTML in Google Sheets
In order to start a new ImportXML/HTML query you need to go to:
Insert -> Function -> Web -> ImportXML/HTML.
And now you can write your scraping function using an XPath query.
What is XPath?
XPath is an expression language designed to support the query or transformation of XML documents. It was defined by the World Wide Web Consortium and can be used to compute values from the content of an XML document. Wikipedia
So, because the Twitter video was using Wikipedia as an example, let’s do the same for this demo.
Importing Wikipedia table data into Google Sheets
I’m going to use this F1 driver’s records page to import table data about the drivers that have the most wins in Formula 1. I’m going to use the ImportHTML function because it saves me time from having to tinker with formatting.
Step 1: Finding out the table number
Wikipedia does not assign an ID to any of its tables. So, you have two options – count down how many tables there are in total (boring), or use DevTools to quickly inspect the full XPath to a specific site element. In our case, the table for the drivers with the most wins.
Go to the Wikipedia page I linked, and find the Wins table. Next, Right click on the table to inspect the table’s element. From here, go to Copy -> Copy full XPath.
Once you do this, your clipboard will have the following string inside it:
/html/body/div[3]/div[3]/div[5]/div[1]/table[12]
/* this string is useful when you're doing full XML imports */
The string we care about for this demo is table[12]. This tells us that the table which lists the drivers with the most wins is number 12 on the page. So, we can go back to Google Sheets and write our query string to pull that table data inside our spreadsheet.
Step 2: Writing our query function to fetch the table data
We can now use ImportHTML to fetch the data we want. Here is what the full function looks like:
/* remember to use quotation marks ("") */
=IMPORTHTML("https://en.wikipedia.org/wiki/List_of_Formula_One_driver_records", "table", 12)
And here is the result after executing it:
The possibilities for using these functions are pretty much endless. For example, let’s say you wanted to import all external URLs on a particular page into your spreadsheet.
In that case, you could use the ImportXML function and this XPath query:
/* this will import all the links on any given page except the links that are
of the domain itself that you're querying */
IMPORTXML("https://google.com", "//a[not(contains(@href, 'google.com'))]/@href")