Web scraping in Power BI | Community Webinar

Web scraping in Power BI | Community Webinar


>>Hello. Good morning
or good evening, depending on where
you’re all located. Thank you for joining
our Power BI Webinar, and my name is Deepak Shankar. I’m a Community Manager within the Microsoft business
application group. Today, we have Indira Bandari with us who will walk you through a new
Power BI features and steps. Indira, it’s all yours now.>>Thanks Deepak. Today’s Webinar is about Web Scraping in Power BI. Web scraping can be
done in a lot of ways. But before I proceed, I just want to share with you
just a small snippet of who I am. I’m a Microsoft Data Platform MVP, and I have a Masters
degree in Statistics, and I’ve been working on
SQL Server since 2002. I have gone through all the changes that SQL
Server has gone through, and I’m a speaker and a co-organizer, mainly speak in New
Zealand and Australia. In my pastime, I love teaching kids. I started off teaching kids
with game development in 2016, and every year, I do some
technology teaching for kids. For the past two years, I’ve been
teaching Power BI to the kids, and it has been an
awesome experience. You can see my details if
you want to connect with me on the left-hand
side. So moving on. Today’s agenda is, I want
to cover it a little bit of what Web Scraping is, and what use cases will be common
scenarios of Web Scraping. Also, the different types of Web Scraping that can
be done using Power BI. So what is Web Scraping? Web Scraping is a process used
to extract data from websites. If we look at the common scenarios, it can be used to build a
price comparison Website. We can use Web Scraping
to collect prices, product description, and images. This data can be used
to gain insights, as well as for comparison purposes. Also, the next scenario, you can build a product
catalog automatically. The next scenario is you can use Web Scraping to
do competitor analysis, so to extract the competitor data, and also customer sentiment
as to how people are feeling about the product in a
structured and usable format. You can use Web Scraping for financial data analysis to get financial statements
in a useful format, and also to analyze the same
data to gain some insights. The next scenario you can
use is LEAD GENERATION. So together contact
details like e-mail ID, phone numbers for business purposes, or even individual purposes, some of the non-profit
organizations can also use Web Scraping to gather this
data and then generate leads. It can also be used for brand
monitoring and reputation. So you can use Web Scraping to
actively build brand intelligence, and monitor the brand preparation to understand how customers feel
about a product or a service. Some other scenarios are like
gathering data for machine learning. Machine learning
requests lots of data. So Web Scraping can be used to gather huge amounts of data across
millions of websites. Without Web Scraping, it is humanly impossible to
do machine learning. Some of the job boards that
you see, the job listings, they also use the Web Scrapings
to extract the data and display the job listings in a way that is more understandable
to the users. Also, I think this can be
used for MAP compliance. MAP compliance, you might
wonder what MAP is. Map is Minimum Advertised Price. So it needs some compliance, and it needs to be tracked. That publicly advertised
price across online stores, it can be monitored. So monitoring retail channels
is important for manufacturers, and this can be easily
done by Web Scraping. Otherwise, it is very
hard to do this manually. Also, the social media analysis. So for example, if you want to
know what people are thinking, you offer your product in Facebook, or Instagram, or Twitter, or other social media. You can use Web Scraping, and then you can analyze, gain insights from the data you
data gathered using Web Scraping, and get the customer sentiments, and also text analysis. So similarly, the news monitoring, if you want to monitor specific
content from all news channels, or even hashtags, you
can do that using the Web Scraping technique. Also, search engine optimization. If you want to monitor the
search results for your product, you can use the Web Scraping to do the search engine
optimization as well. You can gather the
search engine results and then see whether the
trend is going up or down, based on that, you can tweak
your tags and things like that. These are the common scenarios that I have shared with you
now for Web Scraping. Moving on, let’s start with the
simple Web Scraping scenario, I’ll pick up this link
that you are seeing. This link is the population of
the United States by country, so I’m just taking the United States. In this scenario, we are
going to use Web Connector. Let’s get into the demo part. I’m just going to copy this URL. I think, I have already, yeah. This is the URL that I have pasted, and this is the Website. From this, if you
look at this Website, so we’re going to scrape
this table in to Power BI. How do I do that? I just copy this link here, the URL, and then go into Power BI. When you open Power BI,
you get “Get Data”, so just go “Get Data”, and then you get the
Get Data dialog box. Here, you can’t see because
there are a lot of connectors, you can’t see the Web Connector. In order to search for
the Web Connector, just go here, and then search. Then once you search, you
get the Web Connector, just double-click on that, and paste your URL here. If you click on “Okay”. It’s taking a little while. It will display the
divisions on the HTML page. Hopefully, it’s not
taking too much time. You have Table 0, just you can preview. Table 0 is not what we want. Document is just CSS, and if you go to Table 1, so this is what we want. So once you preview this one, see how this is not highlighted. If you tick this one, only then it gets highlighted
and then you can just load the data or if you click
on “Transform Data.” If you want to do some
manipulation you can click on “Transform Data” and
then have a look here. If you want to remove some data, you can remove “Change the
datatype” you can do that, and the table will
be in your Power BI. So I can just say, this is ws simple, so you will get the data. So I will stop here for
the first scenario, which is using the Web Connector. So now, I’ll just go
back to the Power BI, so we have done this one. Next scenario is Get data by example. So again, I am going
to copy this URL, and let’s see what the URL contains. So what this contains is, this has a proper way of displaying
three hourly observations. This is the Met service
data for New Zealand, for Northland and Auckland. We have some towns here; Cape Karikari, Cape Reinga, and things like that. So I want to get this into Power BI. So we’ll use the same method. So I’ll just go into Power BI. Again, go into get data and then
use the Web Connector again, and then paste the URL here. Once you do that, you
would expect that to display the
information like a table, like how we have seen in
the previous scenario. But you are not getting that, because the Web Connector is not
that polished from Microsoft, so we need to do some modifications. So what Microsoft has provided is, you can add a table using examples. So that’s what I will click. See here there is a small
button on the navigator. So I’m going to just click that, and what it does is
it’s going to give me a web view and a column view. So this is the web view, and you can scroll through
and see what you are seeing. So I want to get these ones. So Cape Karikari into one column, and temperature into one column, wind speed into one
column, and so on. So how do we do that? So when I first started, I tried clicking here
but nothing worked, so I started typing here. So when I go “Cape Karikari” so this is the one,
see that is the one. So I’m just going to add that. After adding, I want to
add the temperature. It is not really intuitive to
have gray colored column here. It took me a while to figure
out how to add the next column, so you just go and click here, and then it adds a new column. So now, I want to
add the temperature. Before I add, I just want to
see what the temperature is. It’s n/a, so I need to add that. Add the temperature. I need to look for n/a. Hopefully, I can find it. I think, it went up. See, this is the n/a, and I want to add the wind
speed, 17 kilometers. I will type “wind speed” and
“17 kilometers per hour.” Let’s add these three for now. Because this is an example, I think, it takes a couple of
records to recognize the pattern. I will add now Cape Reinga, Cape Reinga at undefined. See, now it recognizes the pattern and automatically
creates that column. If you go again, temperature n/a, and add it and now it has
recognized the other. If you look at the details, yeah, before that, we
will add the wind speed. Wind speed is n/a here, so wind speed and if you move out of it,
then it recognizes. If we look below, so the next one is Kaikohe, which is automatically
recognized and add it, and temperature is 11 degrees, and four kilometers
is the wind speed. Once you say, “Okay” here, you can add remaining
columns as well. Now, if you click here, then you can preview that and
then you can do a transform data, and maybe you can rename
the columns to town and this too you can rename it as
temperature and also wind speed. What you can do is, you
can replace temperature, colon, space with nothing, so that you just get the temperature. So replace values, temperature, colon, space and
replace with nothing, and it will give you-all
the replaced ones. Similarly, we can do that
with the wind speed. So wind speed, colon, space replace all with spaces. So this is in the Power Query Editor, and then you can close
and apply this one. So that’s the second way of doing
it with “Add data by example.” Now, let’s go back to the PowerPoint and then
let’s see the next scenario. Next scenario is get data
based on HTML and CSS. So I have a blog here. This is a famous blog in
Power BI, by RADACAD. The authors are from New Zealand, so I pick this one. So if I go to the RADACAD blog. See the blog is here, situation is like this. I want to extract
these article titles. So the way I have done it is, see data one “Time Series Anomalies Detection,” “Same Period
Last Year to Date.” Those are the titles I want
to extract in my Power BI. So if I want to do that, I will just go through this one and go into
the Web Connector again. Get data, web connector, and paste the URL, and then click “Okay” and just wait for it to load. What this displays is the HTML or CSS which we saw in the earlier post, it was named as document. So let’s see what it comes up here. Well, it’s taking a while. Any questions so far?>>No questions Indira,
you can continue.>>Okay. So here, we can
see that it has an element, HTML and it’s a table. So I will just click on this and click on “Transform Data” and here, you need to click on
“Transform Data.” Now, you see that in the Children, you can see that there is a link, and there are two tables. I will just click on this one, then see here that contains
a header and a body, and the body contains again a table. I’ll just expand that, and then here again there is one
division which contains a table. Again, when you look at this one, there are two tables here. I’ll just expand that, and when you go here, there is nothing here, there is no table link
so I won’t expand that, but when you go here, you can see three tables with the
header, division, and footer. I will expand that, and if you look at the header, there are two tables, and if you look at DIV, there are tables, and if you look at the footer, there are two tables. What I will do is I will just expand the div
table to see whether the, and if I see here maybe
I’ll expand this one. When I expand this one, I’m just interested in
the Children because that’s where the table is, and I’m going untick the “Use
original column name as prefix.” Now, I can see the two tables, and here also again there are tables. I’ll expand these two, and then untick these two, and I’m interested
only in the Children. If I look at this one, that’s the main and that’s the aside. I am interested in the main table, so I’ll just expand that, and see here you can see
the name of the article, and that’s what I’m interested in, and then the name as
article name this thing. So I’m interested in that. So I’m just going to use
both Name and Children. So this is the article. So now I can have a look at this one, and then see and expand the table, and again I will include
only the Children column, and then this also I will expand, and again we’ll get
the Children column. So if we look at, yeah, again if we look at that one, there is a division and the header. So I think each one has, that’s an image, that’s a header. So I’m going to expand
this one again. So in the applied steps, if you do something wrong, you can just click on
this thing and go back. So here, I think you can
do this using the source. View paid source as well to find
out which one and it’s going to. So what I want is same
period last year. So I can just find that out. So this is the title I want. So it’s div class, article id in the article, div class and image. So I think because of time constraint
I’m just opening. Meanwhile, we try again. If we go here. Okay. Here, I have
expanded the Children, again expanded another Children. Within the Children
again when I expanded, then the third one, then I could see the text, I think I omitted the
text in the previous one. In the text, we have the time series are normally
and things like that. After that, after
getting these texts, I wanted to remove the nulls and all the other data that I don’t need. What I’ve done is I
have filtered the rows, so I’ve unticked all these and
then just got the filter rows. These are the, scrape through that, just the article titles. The way I have gone wrong is I think, I should have
included the text as well. That’s the third one, third type of web scraping that
you can do using the HTML and CSS. If we look at the fourth one, this is the web scraping
using Power Query. Here, this is a bit complicated and this the technique that
I’m going to show you has been derived by three people:
Miguel, Matt, and Liam. Liam wrote a blog post which I have shared in this
PowerPoint at the end. I’m using that blog post to show you to extract the data
for using Power Query. So if we look at the link
that I’m going to use, this is a hockey statistics
for 2017 and 2018, and this table is what
I want to scrape. So when we look at the URL, it is very simple here. It is visible to the users. So if we just use that, then what happens is I’m
just going to give you an overview before I
dive into the demo. So when you just use that and
scrape using Power Query, you will get only the first page of data which is restricted
to about 50 rows, but there are 20 pages of data, and we want all the 20 pages
into Power BI to analyze. So in this scenario, what we need to do is we need
to follow three or four steps. First thing is to see
what the page number is. In the first scenario when you just use this it will go
to the first page, and you want to go to the
next page till the last page. So there are two scenarios here. You need to see how many
pages are there and also to see how to get
all pages into one table. So this scenario we will go through, and because the URL is not
self-explanatory with the page numbers, you need to use a tool
called Fiddler to get the exact URL that uses
the page variables. So now I’m going to just copy this. Let’s dive into the demo. So this is Power Query and it does, so I just want to. So “Home”. Okay. So “Get
Data” and “Web” again, and we’ll copy that, let’s see what we are going to get. So we get the table. Let’s look at the table. This is what the table is. I just click on the table and then go transform data because
that’s what we want to do. Now, when we transform the data, if we look at the data
as mentioned earlier, we get only 50 rows, but we want all the rows
that are there in the table, including all the pages. To do this, first, what we need to do is we need to pass a
parameter for the page number. The first step is to
create a new parameter, and I will just call that
parameter as PageNumber, and I will start that with 1. Current value is 1, and
then create a page number. Now, all the steps are
defined in this thing, so now, using Fiddler, this is the code, this is the URL that was detected in this
that shows the page number. See page and percent, page equals. So we’ll try this one in the browser and see whether
the data is displayed. When I try this one, see, this is how it displays, and this is on Page 2
because I have parsed the page number 2 to
the parameter page. I’m going to use this link. I’m going to modify in Power BI, in the Power query editor, go to Advanced, and I’m going
to modify this URL with that. First step is this URL with that, and also, parse the
page number parameter. To do that, again, I’ll go into Power BI. To create a new
parameter page number, this is the code that we need to
parse to create it as a function. We need to create it as a function so that the page numbers are passed. If you go here, so
now, if you do that, then you can see that this
has changed to a number. So I’m just going to call this
function as, let’s say, PageData. So now this is a function. If I invoke this function, and let’s say I’m parsing Page 3, and invoke, it will
give me Page 3 status. So Page 3 starts at
101, so a 101-150. But how do we get all the pages? So in order to do that, we just need to create another query. But before we do that, I just want to go through the functions that we used
to create that query. So if we go here, see here, I have put
the all pages code. So if you copy this one, of course, if you copy this one
maybe into a notepad, yeah, I think I have
it in the notepad. So you need to remove
those bullet points. So here, what I’m using is
a list generate function. So it generates the list
of the number of pages. So what this is trying to do is this is page data function
that we have created. So if we invoke the function
with Number 1 as the parameter, if it has values, it will give the values. If there is no data on Page 1, then it should give us null. Otherwise, it’s going to error. So that’s the logic here
that is being used. So I’m just going to copy this one, and go here, and then I’m going to create
a new query and blank query, and I’m going to go to
the Advanced Editor, and then remove what
is there in that, and then I’m just going
to use this query. Page data, remember, if you use the function name as
something different, you need to change the page data to that function name
that you have created. So this has created table values, and the number of rows is 20, which is indicating the
number of pages as 20. If we look at the table here, you can see all the values. Now, what we need to do is we can’t just expand
because this is just a list, because we are generating a list. So now what we need to do is just
convert this list into a table, and then take the
parameters as it is, don’t make any changes. Now, see, there is an expand button, so unless you convert
the list to a table, you can’t expand the column. Now, before we expand, let’s see what we can see. So this has rows, row numbers, and the names, and age, and all the statistics. We can just expand, and make sure when you are expanding, unpick the use original
column name as prefix. If you keep that, it comes up with column1.pk, column1.name, and things like
that, which we don’t want. Once you expand, you get the data, and when you look at the data, see it says 956 rows
have been imported. Here it says 956 throws. Let’s go and see how many
number of rows are there, let’s say, in the last page. Luckily, there are row numbers. There are 956 rows, so we have imported all the 956 rows. We can just go and then
apply the changes, and then do the visualizations or data modeling that we want to do. This is the hockey data. Now, if you go back
to the power query, there is a more
detailed explanation in this blog post here by Liam Bastick, who is an Excel MVP, so you can refer to that. Lastly, to recap of what we learned. We learned what Web Scraping is, and what are some of the common scenarios that
we can use Web Scraping, and the simple mode of Web
Scraping Get Data by example, we have seen using
the Met service data, metadata, and get data
from HTML or CSS. How to scrape the article names
of the blog that we have seen. Lastly, we have seen Web
Scraping using Power Query, and there are other scenarios
as well where you can scrape using R and Python, which I haven’t included. I thought that I will not be able
to finish demoing those to you. That’s the Web Scraping in Power BI. Any questions?>>Yeah, no questions
Indira until now.>>Yeah. I think, that’s
the end of my presentation.>>Okay. You’re done
with the presentation?>>Yes.>>Awesome. It’s a very
good presentation, and it’s a very good Webinar. Do you have something
else to proceed on this, if people want to connect
you to the community, like a place where they
can touch base with you?>>Yeah. My details are here. My Twitter details and
my LinkedIn account. These are the details, I can share the PowerPoint.>>All right. Thank you so much. It’s been a very good Webinar.>>Thank you. Thanks a lot
for the opportunity, Deepak.>>Thanks, Indira. Bye.

Leave a Reply

Your email address will not be published. Required fields are marked *