Revision as of 11:21, 29 August 2008

Scrape creation for dummies

Chapter one

First, some very important reference information, not to read it right now but keep the URLs on hand...

Introduction to scraper creation: HOW-TO Write Media Info Scrapers (introduction)
Reference to scraper structure: Scrapers
Tool to test scrapers: Scrap (Download NOW both files referenced there, scrap.exe & libcurl.dll)
Some info about regular expressions: Regular Expression (RegEx) Tutorial
More info on regular expressions from wikipedia: http://en.wikipedia.org/wiki/Regex

How a scraper works

In a nutshell:

If there is movie.nfo, use it (section NfoUrl) and then go to the last step
Otherwise, with the file's name generate a search URL (section CreateSearchUrl) ang get the results
With the results generate a listing (section GetSearchResults) that has for each "candidate" movie a user-friendly denomination and one (or more) associate URLs
Show the listing to the user for him to choose and select the associate URL(s)
Get the URL's content and extract from it (section GetDetails) the apropriate data for the movie to store in videodb

Each one of that four sections is made as a RegExp entry that has this structure: <xml>

     <RegExp input=INPUT output=OUTPUT dest=DEST>
        <expression>EXPRESSION</expression>
     </RegExp>

</xml> INPUT is usually the content of a buffer (in a moment we see what that is) OUTPUT is a string that is build up by the RegExp DEST is the name of the buffer where OUTPUT will be stored EXPRESSION is a regular expression that somehow manipulates INPUT to extract from it information as "fields". If EXPRESSION is empty, automatically a field "1" is created which contains INPUT

Here a "buffer" is just a memory section that is used for communication between each section and the rest of XBMC. There are twenty buffers named 1 to 20. To express the content of a buffer you use "$$n", where n is the number of the buffer.

The fields get extracted from the input by EXPRESSION just by selecting patterns with "(" and ")" and get named as numbers sequentially; the first one is \1, the second \2 up to a maximum of 9.

A very easy example: <xml>

     <RegExp input="$$1" output="\1" dest="3">
        <expression></expression>
     </RegExp>

</xml>

As input the content of buffer 1 is used
The output will be stored in buffer 3
As expression is empty, all the input ($$1) will be stored on field \1
As output is simply \1, al its content will be used for output, that is, $$1

So, the end result will be that the content of buffer 1 will be stored on buffer 3

If you do not know anything about regular expressions, this is the moment to make a quick study of the principles of them from the references above.

Another example, this time we use a string as input and use a very simple regular expression to select part of it <xml>

     <RegExp input="Movie: The Dark Knight" output="The title is \1" dest="3">
        <expression>Movie: (.*)</expression>
     </RegExp>

</xml> There, when we apply the expression to the input, the selected pattern (.*) becomes field 1, in this case it gets assigned "The Dark Knight". The output will so be "The title is The Dark Knight" and will be stored in buffer 3.

The most important sections in a scraper

Now, let's have a look into the 3 "important" sections: CreateSearchUrl, GetSearchResults and GetDetails. first there is some basic information about them we need to know.

CreateSearchUrl must generate the URL that will be used to get the listing of possible movies. To do that, you need the name of file selected to be scraped and that is stored by XBMC in buffer 1.

GetSearchResults must generate the listing of movies (in user-ready form) and their associate URLs. The result of downloading the content of the URL generated by CreateSearchResult is stored by XBMC in buffer 5. The listing must have this structure: <xml> <?xml version="1.0" encoding="iso-8859-1" standalone="yes"?> <results>

  <entity>
     <title></title>
     <url></url>
  </entity>
  <entity>
     <title></title>
     <url></url>
  </entity>

</results> </xml> Each <entity> must have a <title> (the text that will be show to the user) and at least one <url>, although there can be up to 9. You can generate as many <entity> as you need, they will become a listing show to the user to choose.

Once the user has selected a movie, the associated URL(s) will be downloaded.

Last, GetDetails must generate the listing of detailed information about the movie in the correct format, using for that the content of the URL(s) selected from GetSearchResults. The first one will be in $$1, the second in $$2 and so on.

The structure that the listing must have is this: <xml> <details>

   <title></title>
   <year></year>
   <director></director>
   <top250></top250>
   <mpaa></mpaa>
   <tagline></tagline>
   <runtime></runtime>
   <thumb></thumb>
   <credits></credits>
   <rating></rating>
   <votes></votes>
   <genre></genre>
   <actor>
       <name></name>
       <role></role>
   </actor>
   <outline></outline>
   <plot></plot>

</details> </xml> Notes:

Some fields can be missing or empty
<thumb> contains the URL of the image to be downloaded later
<genre>, <credits>, <director> and <actor> can be repeated as many times as needed

Some important details to remember:

When you need to use some special characters into the regular expression, do not forget to "scape" them:
- \ → \\
- ( → \(
- . → \.
- (etc)
Since the scraper itself is a XML file, the characters with meaning in XML cannot be used directly and so you must use its aliases:
- & → &
- < → <
- > → >
- " → "
- ' → '
If you use non-ASCII characters in your XML to be used in the output (umlauts, ñ, etc), they must be coded with the appropriate encoding as expressed in the XML file (in our example it was iso-8859-1, as you see in the code)

Our first working scraper

Now, with all that information, let's create our first scraper. Just create a dummy.xml file with this content and study it a little, it should be fairly easy to understand with what we already know: <xml> <scraper name="dummy" content="movies" thumb="imdb.gif" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"> <NfoUrl dest="3"> <RegExp input="$$1" output="\1" dest="3"> <expression></expression> </RegExp> </NfoUrl> <CreateSearchUrl dest="3"> <RegExp input="$$1" output="<url>http://www.nada.com</url>" dest="3">

       	 <expression></expression>

</RegExp> </CreateSearchUrl> <GetSearchResults dest="8"> <RegExp input="$$1" output="<?xml version="1.0" encoding="iso-8859-1" standalone="yes"?><results><entity><title>Dummy</title><url>http://www.nada.com</url></entity>" dest="8"> <expression></expression> </RegExp> </GetSearchResults> <GetDetails dest="3"> <RegExp input="$$1" output="<details><title>The Dummy Movie</title><year>2008</year><director>Dummy Dumb</director><tagline>Some dumb dummies</tagline><credits>Dummy Dumb</credits><actor><name>Dummy Dumb</name><role>The dumb dummy</role></actor><outline></outline><plot>Some dummies doing dumb things</plot></details>" dest="3"> <expression></expression> </RegExp> </GetDetails> </scraper> </xml>

A really stupid scraper with no meaningful use whatsoever: be it any movie feeded, it will always generate the same (fake) data, also it will download information from www.nada.com and not use it at all, but nevertheless we have our first working scraper, congratulations!

To test it in windows, put in any directory the files scrap.exe and libcurl.dll that are referenced at Scrap and the dummy.xml file and then execute for example this:

scrap dummy.xml "Hello, world"

It should execute without errors and show you each step and its output.

You can also try it in a "real" XBMC, just copy dummy.xml to XBMC\system\scrapers\video, start XBMC, choose any directory from your sources that contains a video file not incorporated into the library, "set content" of the directory to use "dummy" as scraper and finally select "movie info" over the video file. All our fake data will be incorporated into the video database.

Chapter two

Introduction

Now that we know how to create a skeleton scraper, let's re-create a real one. I've chosen one fairly simple, the one used to scrape the spanish site culturalia.es (in fact the URL is http://www.culturalianet.com). First of all, we must know how works the site we intend to write the scraper for.

Open http://www.culturalianet.com. To perform a search, write "la noche es nuestra" (spanish title for "we own the night") in the buscar:box in the top of the page. When you press the Buscar ("Search") button, the URL opened is:

http://www.culturalianet.com/bus/resu.php?texto=la+noche+es+nuestra&donde=1

GetSearchURL

so, very easy, our search URL will be "http://www.culturalianet.com/bus/resu.php?texto=" + (text to search) + "&donde=1"

For example: <xml>

 <RegExp input="$$1" output="http://www.culturalianet.co/bus/resu.php?texto=\1&donde=1" dest="3">
	 <expression></expression) 
 </RegExp>

</xml> So far, so good; in field 1 goes the input (the name of the movie, already stripped by XBMC of the file extension and some common words like "divx", "ac3" and so on), and to generate the output we just write \1 at the point we need.

GetSearchResults

Now we must understand how the results page is formatted; for that, the function "View selection source" of firefox is very useful. Just select the end of the header of the listing and some of the first entries and "view selection source", this is what I get:<xml>Se han encontrado 249 artículos. Se muestran del 1 al 25. <a href="resu.php?donde=1&texto=la%20noche%20es%20nuestra&muestro=1">26</a> <a href="resu.php?donde=1&texto=la%20noche%20es%20nuestra&muestro=2">51</a> <a href="resu.php?donde=1&texto=la%20noche%20es%20nuestra&muestro=3">76</a> <a href="resu.php?donde=1&texto=la%20noche%20es%20nuestra&muestro=4">101</a> <a href="resu.php?donde=1&texto=la%20noche%20es%20nuestra&muestro=5">126</a> <a href="resu.php?donde=1&texto=la%20noche%20es%20nuestra&muestro=6">151</a> <a href="resu.php?donde=1&texto=la%20noche%20es%20nuestra&muestro=7">176</a> <a href="resu.php?donde=1&texto=la%20noche%20es%20nuestra&muestro=8">201</a> <a href="resu.php?donde=1&texto=la%20noche%20es%20nuestra&muestro=9">226</a>

<a href="../art/ver.php?art=29405" target="_top">Noche es nuestra, La.</a> We Own the Night. De James Gray (2007) <a href="../art/ver.php?art=23798" target="_top">10 + 2: La noche mágica.</a> 10 2: La noche mágica. De Miquel Pujol Lozano (2000)</xml>

See? we simply need to select for each entry, the title and maybe some information and then the URL, and repeat that for all the entries in the listing. Fortunately, XBMC offers us some resources to help that we haven't seen yet: the "expression" part of RegExp can have some attributes, in this case, to repeat the appliying of <expression> to the input as many times as there are data for ir, we simply add 'repeat="yes"' as an attribute: <xml><expression repeat="yes"></xml>

and now let's go for the expression. We will extract the culturalianet's ID of the article about the movie, the spanish title, the original title, the name of the director and the year of the movie. The ID we get from:<xml><a href="../art/ver.php?art=29405" target="_top"></xml>

is just a string of numbers, to select it as a field we surround it with parentheses:<xml><a href="../art/ver.php\?art=([0-9]*)" target="_top"></xml>

after that, there is the spanish title, ending in a dot and followed by </a>, so we select as our second field a string of any lenght (must have at least one character) that does not contain "<":<xml>(.[^<]*)\.</a></xml>

Then there is some formatting and, surrounded by and , the original title (again a string of one or more characters). we jump over the formatting with [^]* and select our third field:<xml>[^]*(.[^<]*)</xml>

Then there is and the literal "De " followed by the director's name up until the year of the movie that appears surrounded by parentheses:<xml><\i>\. De (.[^\(]*)</xml>

and our fourth and last field is the movie year, ending (but not including) the character ")":<xml>$([0-9]*)$</xml>

all put together and exchanging "<" for "<" etc, this is our <expression>: <xml><expression repeat="yes"><a href='../art/ver.php\?art=([0-9]*)' target='_top'>(.[^<]*)\.</a>[^]*(.[^<]*)<\i>\. De (.[^$]*)\(([0-9]*)$</expression></xml>

there, the fields will be:

\1 ID of the movies's article in culturalianet.com
\2 Spanish title
\3 Original title
\4 Director's name
\5 Movie's year of first exhibition

Each of our <entity> will have a <name> in the form:

'Noche es nuestra, la' (We own the night) de James Gray (2007)

or, with our actual fields:

'\2' (\3) de \4 (\5)

Also there will be a <url> generated by:

http://www.culturalianet.com/art/ver.php?art=\1

Like we did with our dummy scraper, we add all the necessary headings and this is the result:<xml><GetSearchResults dest="8"> <RegExp input="$$5" output="<?xml version="1.0" encoding="iso-8859-1" standalone="yes"?><results>\1</results>" dest="8"> <RegExp input="$$1" output="<entity%gt;<title>'\2' (\3) de \4 (\5)</title><url>http://www.culturalianet.com/art/ver.php?art=\1</url></entity>" dest="5"> <expression repeat="yes"><a href='../art/ver.php\?art=([0-9]*)' target='_top'>(.[^<]*)\.</a>[^]*(.[^<]*)<\i>\. De (.[^$]*)\(([0-9]*)$</expression> </RegExp> <expression noclean="1" /> </RegExp> </GetSearchResults></xml> There are a few things there we have not seen yet. For starters, see that there are two anidated regexp; they get evaluated from the inner ones to the outer ones. Also, there is an attribute for <expression> we haven't seen yet, 'noclean="1"'; by default, XBMC will strip the expression of all HTML formatting, but here we do not want that, so we add that to indicated that we do not want XBMC to clean our input before using it.

also, and this is a XML standard, you can shorten empty XML clauses like

<expression></expression>

by writing instead:

<expression/>

So, how does XBMC execute this? it goes to the inner regexp and using input="$$1" (the content of our search url), applies to it expression and generates our fields: <xml><expression repeat="yes"><a href='../art/ver.php\?art=([0-9]*)' target='_top'>(.[^<]*)\.</a>[^]*(.[^<]*)<\i>\. De (.[^$]*)\(([0-9]*)$</expression></xml>

In the previous line, for clarity, I'm using < instead of <

That code generates this output to buffer 5: <xml><entity><title>'\2' (\3) de \4 (\5)</title><url>http://www.culturalianet.com/art/ver.php?art=\1</url></entity></xml> repeats it as long there is a <expression> match in input, generating as many <entity>, and all goes to $$5

Then, the outer regexp gets executed, it uses as input $$5 that has just been generated; it does not modify anithing (empty <expression> means all input goes to \1) but remember to use the noclean clause to maintain the necessary formatting. Simply takes all the <entity>s generated and inserts them in the correct xml structure: <xml><?xml version="1.0" encoding="iso-8859-1" standalone="yes"?><results>\1</results></xml>

All output goes to buffer 8.

@@ Line 189: / Line 189: @@
 ===GetSearchResults===
-Now we must understand how the results page is formatted; for that, the function "View selection source" of firefox is very useful. Just select the end of the header of the listing and some of the first entries and "view selection source", this is what I get:<xml>Se han encontrado 249 artículos. Se muestran del 1 al 25. <a href="resu.php?donde=1&amp;texto=la%20noche%20es%20nuestra&amp;muestro=1">26</a> <a href="resu.php?donde=1&amp;texto=la%20noche%20es%20nuestra&amp;muestro=2">51</a> <a href="resu.php?donde=1&amp;texto=la%20noche%20es%20nuestra&amp;muestro=3">76</a> <a href="resu.php?donde=1&amp;texto=la%20noche%20es%20nuestra&amp;muestro=4">101</a> <a href="resu.php?donde=1&amp;texto=la%20noche%20es%20nuestra&amp;muestro=5">126</a> <a href="resu.php?donde=1&amp;texto=la%20noche%20es%20nuestra&amp;muestro=6">151</a> <a href="resu.php?donde=1&amp;texto=la%20noche%20es%20nuestra&amp;muestro=7">176</a> <a href="resu.php?donde=1&amp;texto=la%20noche%20es%20nuestra&amp;muestro=8">201</a> <a href="resu.php?donde=1&amp;texto=la%20noche%20es%20nuestra&amp;muestro=9">226</a> </td></tr><tr><td><b><a href="../art/ver.php?art=29405" target="_top">Noche es nuestra, La.</a></b></td></tr>
+Now we must understand how the results page is formatted; for that, the function "View selection source" of firefox is very useful. Just select the end of the header of the listing and some of the first entries and "view selection source", this is what I get:<xml>Se han encontrado 249 artículos. Se muestran del 1 al 25.
-<tr><td colspan="2"><i>We Own the Night</i>. De James Gray (2007)</td></tr><tr><td><b><a href="../art/ver.php?art=23798" target="_top">10 + 2: La noche mágica.</a></b></td></tr>
+<a href="resu.php?donde=1&amp;texto=la%20noche%20es%20nuestra&amp;muestro=1">26</a>
+<a href="resu.php?donde=1&amp;texto=la%20noche%20es%20nuestra&amp;muestro=2">51</a>
+<a href="resu.php?donde=1&amp;texto=la%20noche%20es%20nuestra&amp;muestro=3">76</a>
+<a href="resu.php?donde=1&amp;texto=la%20noche%20es%20nuestra&amp;muestro=4">101</a>
+<a href="resu.php?donde=1&amp;texto=la%20noche%20es%20nuestra&amp;muestro=5">126</a>
+<a href="resu.php?donde=1&amp;texto=la%20noche%20es%20nuestra&amp;muestro=6">151</a>
+<a href="resu.php?donde=1&amp;texto=la%20noche%20es%20nuestra&amp;muestro=7">176</a>
+<a href="resu.php?donde=1&amp;texto=la%20noche%20es%20nuestra&amp;muestro=8">201</a>
+<a href="resu.php?donde=1&amp;texto=la%20noche%20es%20nuestra&amp;muestro=9">226</a>
+</td></tr>
+<tr><td><b>
+<a href="../art/ver.php?art=29405" target="_top">Noche es nuestra, La.</a></b></td></tr>
+<tr><td colspan="2"><i>We Own the Night</i>. De James Gray (2007)</td></tr>
+<tr><td><b><a href="../art/ver.php?art=23798" target="_top">10 + 2: La noche mágica.</a></b></td></tr>
 <tr><td colspan="2"><i>10   2: La noche mágica</i>. De Miquel Pujol Lozano (2000)</td></xml>
-See? we simply need to select for each entry, the title and maybe some information and then the URL, and repeat that for all the entries in the listing. Fortunately, XBMC offers us some resources to help that we haven't seen yet: the "expression" part of RegExp can have some attributes, in this case, to repeat the appliying of <expression> to the input as many times as there are data for ir, we simply add 'repeat=yes"' as an attribute:
+See? we simply need to select for each entry, the title and maybe some information and then the URL, and repeat that for all the entries in the listing. Fortunately, XBMC offers us some resources to help that we haven't seen yet: the "expression" part of RegExp can have some attributes, in this case, to repeat the appliying of <expression> to the input as many times as there are data for ir, we simply add 'repeat="yes"' as an attribute:
 <xml><expression repeat="yes"></xml>
@@ Line 202: / Line 215: @@
 after that, there is the spanish title, ending in a dot and followed by </a>, so we select as our second field a string of any lenght (must have at least one character) that does not contain "<":<xml>(.[^<]*)\.</a></xml>
-Then there is some formatting and, surrounded by <i> and </i>, the original title (again a string of one or more characters). we jump over the formatting with [^<i>]* and select our third field:<xml>[^<i>]*(.[^<]*)</xml>
+Then there is some formatting and, surrounded by &lt;i> and &lt;/i>, the original title (again a string of one or more characters). we jump over the formatting with [^&lt;i>]* and select our third field:<xml>[^&lt;i>]*(.[^<]*)</xml>
-Then there is </i> and the literal "De " followed by the director's name up until the year of the movie that appears surrounded by parentheses:<xml><\i>\. De (.[^\(]*)</xml>
+Then there is &lt;/i> and the literal "De " followed by the director's name up until the year of the movie that appears surrounded by parentheses:<xml><\i>\. De (.[^\(]*)</xml>
 and our fourth and last field is the movie year, ending (but not including) the character ")":<xml>\(([0-9]*)\)</xml>
-all put together and exchanging "&lt;" for "<" etc, this is our <expression>:
+all put together and exchanging "&amp;lt;" for "<" etc, this is our <expression>:
 <xml><expression repeat="yes">&lt;a href='../art/ver.php\?art=([0-9]*)' target='_top'&gt;(.[^&lt;]*)\.&lt;/a&gt;[^&lt;i&gt;]*(.[^&lt;]*)&lt;\i&gt;\. De (.[^\(]*)\(([0-9]*)\)</expression></xml>

HOW-TO:Write media scrapers: Difference between revisions