Let's learn how to get data from PDFs using Power BI. I'll be using data from WHO relating to worldwide coronavirus cases.
I want to start out by saying that by far the most intriguing thing I discovered is that a country's density does not seem positively correlated with the amount of cases per capita.
data:image/s3,"s3://crabby-images/2a823/2a8233f9dfe990c7feb9f10f944fa85ba7062623" alt=""
Density below 1000 people per km squared seems to have a negative correlation with the amount of cases. This would suggest that countries that have a less dense population are spreading the virus more.
Density above 1000 people per km squared seems to have a positive correlation with the amount of cases, although, there aren't many countries in that sample bucket.
data:image/s3,"s3://crabby-images/876fd/876fd1f686d78cae9b5c40c0539dec871cedcb20" alt=""
Getting Data From PDFs
Normally, you can simply use the PDF connector that Power BI has built into the UI.
data:image/s3,"s3://crabby-images/42b75/42b75a0322306c9e66a71eeb8c4ee7a5ef7e2325" alt=""
And from there you would just choose the PDF file. However, my PDF is on the web, so I need to go through the web connector first.
data:image/s3,"s3://crabby-images/e87fc/e87fc73dc05d4fe1c3ec3258a850eff5e0372127" alt=""
Because the URL I'm using is returning a PDF, Power BI automagically gives us the PDF contents. If you are wondering what the M code is, here it is:
let
url =
"https://www.who.int/docs/default-source/coronavirus/
situation-reports/20200714
-covid-19-sitrep-176.pdf",
myWebContents = Web.Contents(url),
myPDF = Pdf.Tables(myWebContents)
in
myPDF
Extracting Data From PDFs
The next part was a bit tricky....
Problem:
"How can I point to a value within a table where the ID isn't consistent?"
The table that I need is split over several pages and the name of it isn't consistent either. For example, in the below snippet, the table I need is ID "Table003". This ID changes depending on which report date you are looking at.
data:image/s3,"s3://crabby-images/cc539/cc53906f0994183f4b541def0e4e6382733181ff" alt=""
This is a problem (especially if you are going to be using this method for multiple PDFs) because if you click to expand the Table003 table, Power Query will store a hard coded reference to that ID.
data:image/s3,"s3://crabby-images/495bd/495bd35dee108b3b24e4e694ea2478ea7ee7d17b" alt=""
Solution:
To remedy the situation, we need to create "hooks" for Power Query to grab onto, and pull its way to the correct table.
1. Filter to show only Tables.
2. Add a helper column - column count. I knew the table I needed only had 7 columns.
Table.ColumnCount()
data:image/s3,"s3://crabby-images/a2727/a27270213557253c3e72bdd98216b2075b30335d" alt=""
3. Filter on the helper column where column count equals 7. Doing this filters down the tables to the one single table that I need.
data:image/s3,"s3://crabby-images/58e07/58e0732cfeb9f250c43114b73b3fef0b33477bbe" alt=""
From there, you just need to click the Table to extract the data.
data:image/s3,"s3://crabby-images/b08db/b08db0a3dff9d8240e92f68f5767dedf6a6483ee" alt=""
Sources
Coronavirus Cases: https://www.who.int
Global Populations: https://www.worldometers.info/world-population/population-by-country/
If you are looking to improve your DAX skills, I would recommend Supercharge Power BI by Matt Allington:
Comments