Revision as of 10:34, 29 August 2008

Scrape creation for dummies

Chapter one

First, some very important reference information, not to read it right now but keep the URLs on hand...

Introduction to scraper creation: HOW-TO Write Media Info Scrapers (introduction)
Reference to scraper structure: Scrapers
Tool to test scrapers: Scrap (Download NOW both files referenced there, scrap.exe & libcurl.dll)
Some info about regular expressions: Regular Expression (RegEx) Tutorial
More info on regular expressions from wikipedia: http://en.wikipedia.org/wiki/Regex

How a scraper works

In a nutshell:

If there is movie.nfo, use it (section NfoUrl) and then go to the last step
Otherwise, with the file's name generate a search URL (section CreateSearchUrl) ang get the results pagechrome://informenter/skin/marker.png
With the results generate a listing (section GetSearchResults) that has for each "candidate" movie a user-friendly denomination and one (or more) associate URLs
Show the listing to the user for him to choose and select the associate URL(s)
Get the URL's content and extract from it (section GetDetails) the apropriate data for the movie to store in videodb

Each one of that four sections is made as a RegExp entry that has this structure: <xml>

     <RegExp input=INPUT output=OUTPUT dest=DEST>
        <expression>EXPRESSION</expression>
     </RegExp>

</xml> INPUT is usually the content of a buffer (in a moment we see what that is) OUTPUT is a string that is build up by the RegExp DEST is the name of the buffer where OUTPUT will be stored EXPRESSION is a regular expression that somehow manipulates INPUT to extract from it information as "fields". If EXPRESSION is empty, automatically a field "1" is created which contains INPUT

Here a "buffer" is just a memory section that is used for communication between each section and the rest of XBMC. There are twenty buffers named 1 to 20. To express the *content* of a buffer you use "$$n", where n is the number of the buffer.

The fields get extracted from the input by EXPRESSION just by selecting patterns with "(" and ")" and get named as numbers sequentially; the first one is \1, the second \2 up to a maximum of 9.

A very easy example: <xml>

     <RegExp input="$$1" output="\1" dest="3">
        <expression></expression>
     </RegExp>

</xml>

As input the content of buffer 1 is used
The output will be stored in buffer 3
As expression is empty, all the input ($$1) will be stored on field \1
As output is simply \1, al its content will be used for output, that is, $$1

So, the end result will be that the content of buffer 1 will be stored on buffer 3

If you do not know anything about regular expressions, this is the moment to make a quick study of the principles of them from the references above.

Another example, this time we use a string as input and use a very simple regular expression to select part of it <xml>

     <RegExp input="Movie: The Dark Knight" output="The title is \1" dest="3">
        <expression>Movie: (.*)</expression>
     </RegExp>

</xml> There, when we apply the expression to the input, the selected pattern (.*) becomes field 1, in this case it gets assigned "The Dark Knight". The output will so be "The title is The Dark Knight" and will be stored in buffer 3.

The most important sections in a scraper

Now, let's have a look into the 3 "important" sections: CreateSearchUrl, GetSearchResults and GetDetails. first there is some basic information about them we need to know.

CreateSearchUrl must generate the URL that will be used to get the listing of possible movies. To do that, you need the name of file selected to be scraped and that is stored by XBMC in buffer 1.

GetSearchResults must generate the listing of movies (in user-ready form) and their associate URLs. The result of downloading the content of the URL generated by CreateSearchResult is stored by XBMC in buffer 5. The listing must have this structure: <xml> <?xml version="1.0" encoding="iso-8859-1" standalone="yes"?> <results>

  <entity>
     <title></title>
     <url></url>
  </entity>
  <entity>
     <title></title>
     <url></url>
  </entity>

</results> </xml> Each <entity> must have a <title> (the text that will be show to the user) and at least one <url>, although there can be up to 9. You can generate as many <entity> as you need, they will become a listing show to the user to choose.

Once the user has selected a movie, the associated URL(s) will be downloaded.

Last, GetDetails must generate the listing of detailed information about the movie in the correct format, using for that the content of the URL(s) selected from GetSearchResults. The first one will be in $$1, the second in $$2 and so on.

The structure that the listing must have is this: <xml> <details>

   <title></title>
   <year></year>
   <director></director>
   <top250></top250>
   <mpaa></mpaa>
   <tagline></tagline>
   <runtime></runtime>
   <thumb></thumb>
   <credits></credits>
   <rating></rating>
   <votes></votes>
   <genre></genre>
   <actor>
       <name></name>
       <role></role>
   </actor>
   <outline></outline>
   <plot></plot>

</details> </xml> Notes:

Some fields can be missing or empty
<thumb> contains the URL of the image to be downloaded later
<genre>, <credits>, <director> and <actor> can be repeated as many times as needed

Some important details to remember:

When you need to use some special characters into the regular expression, do not forget to "scape" them:
- \ -> \\
- ( -> \(
- . -> \.
- etc.
Since the scraper itself is a XML file, the characters with meaning in XML cannot be used directly and so you must use its aliases:
- & -> &
- < -> <
- > -> >
- " -> "
- ' -> '
If you use non-ASCII characters in your XML to be used in the output (umlauts, ñ, etc), they must be coded as iso-8859-1

Our first working scraper

Now, with all that information, let's create our first scraper. Just create a dummy.xml file with this content and study it a little, it should be fairly easy to understand with what we already know: <xml> <scraper name="dummy" content="movies" thumb="imdb.gif" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"> <NfoUrl dest="3"> <RegExp input="$$1" output="\1" dest="3"> <expression></expression> </RegExp> </NfoUrl> <CreateSearchUrl dest="3"> <RegExp input=$$1 output="<url>http://www.nada.com</url>" dest="3">

       	 <expression></expression>

</RegExp> </CreateSearchUrl> <GetSearchResults dest="8"> <RegExp input="$$1" output="<?xml version="1.0" encoding="iso-8859-1" standalone="yes"?><results><entity><title>Dummy</title><url>http://www.nada.com</url></entity>" dest="8"> <expression></expression> </RegExp> </GetSearchResults> <GetDetails dest="3"> <RegExp input="$$1" output="<details><title>The Dummy Movie</title><year>2008</year><director>Dummy Dumb</director><tagline>Some dumb dummies</tagline><credits>Dummy Dumb</credits><actor><name>Dummy Dumb</name><role>The dumb dummy</role></actor><outline></outline><plot>Some dummies doing dumb things</plot></details>" dest="3"> <expression></expression> </RegExp> </GetDetails> </scraper> </xml>

A really stupid scraper with no meaningful use whatsoever: be it any movie feeded, it will always generate the same (fake) data, also it will download information from www.nada.com and not use it at all, but nevertheless we have our first working scraper, congratulations!

To test it in windows, put in any directory the files scrap.exe and libcurl.dll that are referenced at Scrap and the dummy.xml file and then execute for example this:

scrap dummy.xml "Hello, world"

It should execute without errors and show you each step and its output.

You can also try it in a "real" XBMC, just copy dummy.xml to XBMC\system\scrapers\video, start XBMC, choose any directory from your sources that contains a video file not incorporated into the library, "set content" of the directory to use "dummy" as scraper and finally select "movie info" over the video file. All our fake data will be incorporated into the video database.

@@ Line 5: / Line 5: @@
 First, some very important reference information, not to read it right now but keep the URLs on hand...
-Introduction to scraper creation: [[HOW-TO Write Media Info Scrapers (introduction)]]
+*Introduction to scraper creation: [[HOW-TO Write Media Info Scrapers (introduction)]]
-Reference to scraper structure: [[Scrapers]]
+*Reference to scraper structure: [[Scrapers]]
-Tool to test scrapers: [[Scrap]] (Download NOW both files referenced there, scrap.exe & libcurl.dll)
+*Tool to test scrapers: [[Scrap]] (Download NOW both files referenced there, scrap.exe & libcurl.dll)
-Some info about regular expressions: [[Regular Expression (RegEx) Tutorial]]
+*Some info about regular expressions: [[Regular Expression (RegEx) Tutorial]]
-More info on regular expressions from wikipedia: http://en.wikipedia.org/wiki/Regex
+*More info on regular expressions from wikipedia: http://en.wikipedia.org/wiki/Regex
 ===How a scraper works===

HOW-TO:Write media scrapers: Difference between revisions