At work I have been tasked with extracting the product description information for several of our products to be used as an import for an external system. Normally I would just write a query for the database to get all of this information but in this case I do not have access to the database directly. Even if I did have it, I am not familiar with the schema so I would rather not have to spend the effort digging into if I do not have to. I have been putting off this project since it was not high priority but recently I came across an excellent talk by Evelina Gabasova at NDC Oslo where she showed the use of TypeProviders to connect to IMDB to extract data on cast members of Star Wars. As I watched this I had a eureka moment, “Why not just use F# to pull the data directly from the website instead of dealing with the SQL Schema?”
Initial Attempt with HtmlProvider
Now, this may seem a little silly but for my case it has some advantages. I do not have to bother with getting permissions for the database running the company e-commerce website and it allows me to use some F#. I quickly fire up a new F# project in VS Code and stub out the following:
This is when I run into a problem. In the talk that Evelina Gabasova gave the data on IMDB was in a nice table. This meant that the TypeProvider could detect it automatically and provide it as a nice property of the
test object in the above example. My problem is that the data I need is in the Description area of the page, specifically the
<span> with the attribute
itemprop="description". I am trying to turn this information:
into something like this:
|WHAT IT IS
|A lightweight silky gel cleanser for Normal to Oily skin types that gently removes makeup and soothes red, sensitive skin.
|BENEFITS FOR YOU
|Safe for even the most sensitive skin. Removes excess oil and makeup. Soothes and refreshes sensitive, irritated skin.
This means that I need a different approach. Thankfully, F# delivered.
Using HTML Parser
HtmlProvider does not give you what you need for HTML parsing then
FSharp.Data also has a handy HTML Parser which includes some excellent documentation and examples. I put together a new script to extract the data from the website.
Lines 1 through 5 are just getting the HTML for the product listing. The
getDescription function is what actually breaks down the HTML to return a tuple with the information that I am interested in. What I like most about this is that I did not have to use
XPath or some other
XML querying tool. While
XPath may be powerful, I find I I have difficulty achieving what I really want. I find the F# approach shown here much more straightforward.
In line 8 the function is extracting every node in the HTML which is a
span. This will obviously return spans that we are not interested in which is why we need to filter the result using
Seq.filter. Since I know that the span I am interested in has the attribute
itemprop="description", I use a function to return
false when that attribute is not present and
true when it is present. Line 10 highlights one of my favorite features of F# which is the returning of an
Option type. The function
TryGetAttribute will either successfully return the attribute which is a type of
Some 'T or it returns
None. In the case of
None I simply have the function return
false. If the attribute does exist I then test if it is equal to “description” on line 14. If it does match, the function returns
true. In all other cases the function returns
Since I know how the HTML is rendered on these pages, I know that there will only ever be one of these
<span> elements in the HTML so I use the function
Seq.exactlyOne to select a single element from the sequence. Line 18 to 27 is where I actually pull out the information that I want. I have an odd problem in that the text in the
<strong> element is what I want the tag name to be for the output table and the rest of the text in the parent
<p> element is to be the text data. To do this I first extract the text in the
<strong> element on lines 21 through 24. I then extract all of the text from the parent
<p> element, which includes the
<strong> text, and then select a substring which excludes the
<strong> text on line 22. I then return a tuple of the tag name and the associated text.
While I am sure this is not the most elegant way to go about this, it was incredibly simple compared to some previous efforts I have had trying to get
XPath to work on other projects. I find the F# syntax and approach much more straightforward and easier to understand. Is there a better way for me to have done this? Could the code be more idiomatic? All comments and suggestions are appreciated.