Tuesday, May 13, 2014

Get data from a Web site with Excel, Power Query and Kimono

You surely aware about it, Power Query is an Excel add-in published by Microsoft via its Power BI new technology, dedicated to the importation of data from multiple sources.

If we are numerous to see in Power Query a tool with an exceptional potential, there is a field where it is still rather weak: the importation of data displayed on a Web page. Finally it is not really Power Query which is weak, it is rather than after 2500 lines of Javascript, the final HTML of the sites is often completely not exploitable…

To surmount the problem there is a fantastic tool called kimono. With kimono, you will generate in a completely graphic way, an API from a Web site directly from of your browser.
To understand the subject we can take an example, data since MetaCritic Web Site will be used, the Web site which incorporates the notes given in the press (paper or Web) about the video games.

We will start with trying to connect directly to MetaCritic via Power Query, to note the inutilisability of the thing:

1 – I launch Excel, Power Query Menu, option Import from Web, then I write the address of the site source:

2 – Oups, the browser of Power Query does not leave himself there all alone, we will Edit it: 

3 – And boom, good luck! If somebody found a method, I am taking, me I give up there in general…

So let’s connect to kimono (after the subscribtion, we add the bookmarklet to its bar of shortcut) and it will create the API with a graphic way, let’s start!

1 – I open my browser, I go on the site source and I use kimonify (the bookmarklet). The tool bar Kimono appears at the top of the page, and I start by importing the names of the games right while clicking on them in the page of MetaCritic. The engine of Kimono recognizes attribute HTML then and identifies the 88 names, I rename the field “Name of the game”:

2 – I press on + in the bar Kimono, I select the first note (92 for DS2), the second (88 for Hearthstone), and again Kimono identifies the 88 following values. I rename the field “Notes”:

3 – I repeat the manipulation with the Score User, this time the engine of Kimono hesitates a little, it proposes several series to me and by select other scores and to refuse other attributes (as well as the label “To use Score”), it finds me 85 (indeed 3 games do not have this score, but that does not break the recognition):

4 – I have enough information for the moment, I validate (Done in the bar Kimono), I give a name to my API, and a period of refreshing (real time for the demonstration, but one can reduce the load on the source in refreshing the data set only periodically):

5 – And Kimono returns me to the dashboard of my API:

6 – Via the menu “How To Use” I find the necessary elements to reach the API, including the URL (endpoints) which I will be able to transmit to Power Query in JSON, CS or RSS:

7 – We can test EndPoint CSV (yeah I am oldschool) immediately:

8 – But it’s better to call it directly from Power Query (From Web always):

9 – And after some small handling (we delete the first line, we separate the columns by delimiter, we use the first line like a name of column, we clean User Score, we filter the lines of waste), we obtain the good dataset. Note: if you choose the JSON and It’s not working, it is that the API is badly formed, best is to break it and start again (Kimono is still in beta).

10 – to obtain the expected data in Excel:

Personally I find that just enormous! And it’s all in live, Query in Excel and the API side Kimono, therefore it is enough to refresh to get the updated data from the source.

No comments:

Post a Comment