Web data import to Tableau
- Rosablanche
- Mar 8, 2021
- 3 min read
Updated: Mar 9, 2021
So I needed to reuse data from a table in a web page to support a visualisation in Tableau and I thought surely there's a way of achieving already available in standard.
Well, no...
The web data connector available in the list of standard connection doesn't do that: a simple data extraction from a webpage.
I did a bit of research and found on article on import.io that seemed to do just that. However, it didn't work with Tableau Public or Tableau 2020.1. I tried to create a new account with import.io but was informed that:
As of August 19th, 2020 we have suspended the ability to create new Import.io accounts.
So, back to square one.
Now, Tableau can connect to Google Sheets or upload a file in Excel or csv format; it should also be possible to extract the data from a web table to the formats above and then process it in Tableau.
So let's try to do just that.
I found this blog post:
It explains how to extract data from a web page with the importxml function in Google sheets.
I'm trying to extract the list of most popular websites and some metadata from Wikipedia (link here).
I'm interested in columns: site, domain, type, country.

Most useful info from the blog post above:
a tag with [1] means "only give me the first instance of <tag> inside <parent tag>." So, td/span/a[1] gives you the first link inside the <span> inside each <td>. In the same way, td/b[1] gives you the very first bold text inside each <td>
So I try to replicate this logic with my table, but I need to understand the xml structure first.
In Chrome, right-mouse click in a table cell and select "Inspect" to view the code behind.
I click on the first cell of the table that contains "Google Search and inspect it.

The syntax of the function is: IMPORTXML(url; xpath query). Both arguments are written between quotation marks " ".
The xpath query is written: "//<parent tag>/<child tag>[column index]".
Tags are written without the <>.
In this example, you see that the site name is the the first link <a> of the <td> within a <tr>. It's also the first <td> within a <tr>. So you could write the the xpath query two ways: "//tr/td/a[1]" or "//tr/td/a[1]".
To extract site, I type the formula below in cell A1:
=importxml("https://en.wikipedia.org/wiki/List_of_most_popular_websites";"//tr/td/a[1]")The domain is the <td> number 2 within a <tr>. To extract domain, I type the formula below in cell A2:
=importxml("https://en.wikipedia.org/wiki/List_of_most_popular_websites";"//tr/td[2]")The type is the <td> number 4 within a <tr>. To extract type, I type the formula below in cell A3:
=importxml("https://en.wikipedia.org/wiki/List_of_most_popular_websites";"//tr/td[4]")and here is the result:

It is time now to import the data into Tableau.
Open up any version of Tableau and go to the Connect panel.

Click on Google sheets.
This will open a new tab in your browser where you'll be asked to authenticate (you need to have a google account btw).
Once authenticated, a popup appears with the list of your g-sheets.
Select the one you're interested in and click "Connect".
Tableau displays the list of worksheet available in your g-sheet.
Double-click on the one your interested in and start exploring your data !
I'm not really satisfied with this process, it's way too complicated but that's all I could find after a quick search. I will do more digging in the near future.
For the record, things are so much easier with PowerBI. 3 clicks and you're done!
Here's how:

In the "Get data" panel, select "Web"


Enter the website URL
PowerBI finds the table and identifies the columns
So much easier, no ?


link link link link link link link link link link link link link link link link link link link link link link link link link link link link link link link link link link link link link link link link link link link link link link link link link link link link link link link link link link link link link link link link link link link link link link link link link link link link link link link link link link link link link link link link link link link link link link link link link link link link link link link link link link link link link link