Scrapers

From Official Kodi Wiki
Revision as of 16:20, 26 January 2007 by >Asteron (Fix spelling)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search
Incomplete.png INCOMPLETE:
This page or section is incomplete. Please add information or correct uncertain data which is marked with a ?

Scrapers are used by xbmc to retrieve data from a webpage which is used in the video library. Until version 2.0 of xbmc only a imdb scraper was built into xbmc. Now it is possible to create custom scrapers to collect data from any webpage.

There is no file called Scraper.xml, scraper here is a placeholder for the name of your scraper. XBMC currently comes with a imdb.xml and jadedVideo.xml which are stored in xbmc\system\scrapers\video.

Prerequisites

  • Knowledge of regular expressions
  • Basic XML syntax knowledge.

Layout

The general layout of scraper.xml is as follows: <xml> <scraper>

  <CreateSearchUrl>
     <RegExp>
        <expression></expression>
     </RegExp>
  </CreateSearchUrl>
  <GetSearchResults>
     <RegExp>
        <expression></expression>
     </RegExp>
  </GetSearchResults>
  <GetDetails>
     <RegExp>
        <expression></expression>
     </RegExp>
  </GetDetails>
  <CustomFunction>
     <RegExp>
        <expression></expression>
     </RegExp>
  </CustomFunction>

</scraper> </xml> If RegExp tags are being nested they are being worked threw in a lifo manner.

XML character entity references

Unlike traditional HTML with its large range of character entity references, in XML there are only five predefined character entity references. These are used to escape characters that are markup sensitive in certain contexts:

  • &amp; → &
  • &lt; → <
  • &gt; → >
  • &quot; → "
  • &apos; → '

This means that in the regular expressions you actually have to use entities instead of the actual characters.

Regular Expression Engine

There are a few things to note about the regular expression engine:

  • Laziness doesn't work. e.g. <xml>.*?</xml> (For a workaround see http://www.regular-expressions.info/repeat.html An Alternative to Laziness)
  • \w or \d does not work, use [a-zA-Z] and [0-9] instead
  • Regular expressions are case sensitive
  • A dot matches a new line

Tags in Detail

<scraper>

<xml><scraper name="jaded video" content="movies" thumb="jaded.jpg"></scraper></xml> The sole purpose of this tag is to define the name, type and thumbnail of scraper.

  • name: The name of the scraper
  • content: movies/tvshows/mvid
  • thumb: relative path to a scraper thumbnail

<RegExp><expression></RegExp>

<xml><RegExp input="$$5" output="<details>\1</details>" dest="3"> <expression repeat="yes" noclean="1" trim="1" clear="no">([a-zA-Z][^,]*)</expression> </RegExp></xml>

<expression>

The <expression> tag holds the regular expression. Here it is: <xml>([a-zA-Z][^,]*)whatever (.[^-]*)</xml> The regular expression matches are stored in \1, \2, ... , \9 and can be used in the output attribute of the RegExp tag.

  • repeat="yes/no": Repeat the regular expression.
  • noclean="1,..,9": By default html tags and special characters are stripped from the matches \1, ..., \9. By setting noclean="1, ..., 9" you can stop this behavior.
  • trim="1,..,9": Trim white spaces from the end of matches 1 to 9.
  • clear=yes/no": If set to yes, if the expression fails dest is cleared

<RegExp>

The RegExp tag sets the input which will is being searched by the regular expression

  • input="$$x" where x=1 to 9: The input variable holds the text that will get searched by the regular expression
  • output: Defines how the output should look like. Here you can use \1, ..., \9 which represent regular expression matches.
  • dest="x" where x=1 to 9: The variable to which the output should be stored to. If clear is true, its previous content will get cleared if the expression fails.

<CreateSearchUrl>

<xml> <CreateSearchUrl dest="3"> <RegExp input="$$1" output="http://akas.imdb.com/find?s=tt;q=\1" dest="3"> <expression noclean="1"></expression> </RegExp> </CreateSearchUrl> </xml>

Inputs:

  • $$1: Variable 1 holds the search string. This is usually the filename stripped by some words e.g. DVDRip, Xvid ..

The purpose of this function is to create a variable which will hold the url of the search result page.

  • dest="x" where x=1 to 9: Variable x shall hold the url of the search result page.

In the example above <CreateSearchUrl dest="3"> means that variable 3 should hold the search results page url. The Regular expression tag searches threw variable 1 and stores the searchstring in variable 3. Since no expression is specified the entire variable $$1 is matched and stored in \1. Since this does not contain html tags it is not necessary to clean \1.

<GetSearchResults>

<xml> <GetSearchResults dest="8"> <RegExp input="$$5" output="<?xml version="1.0" encoding="iso-8859-1" standalone="yes"?><results>\1</results>" dest="8"> <RegExp input="$$1" output="<entity><title>\1</title><url>http://jadedvideo.com/yz_resultJAVA.asp?PRODUCT_ID=\2</url></entity>" dest="5"> <expression repeat="yes"><font color="#000FF">(.[^"]*)</big>.[^"]*SKU: ([0-9]+)<br></expression> </RegExp> <expression noclean="1"></expression> </RegExp> </GetSearchResults>

</xml> Inputs:

  • $$1: Variable 1 hold the content of the search URL returned to CreateSearchUrl.

It is the task of this function to return a list of search results in the following format: <xml> <?xml version="1.0" encoding="iso-8859-1" standalone="yes"?> <results>

  <movie>
     <title></title>
     <url></url>
     <url></url>
  </movie>
  <movie>
     <title></title>
     <url></url>
     <url></url>
  </movie>

</results></xml>

There can be up to 9 <url> tags within each movie tag. The webpages specified in the url tags should hold detailed information about the movie/tv series/mvid. In most cases one url will be enough. The url tag can have the argument function="nameoffunction" e.g. <xml><url function="CustomFunction">theurl</url></xml> In this case other then GetDetails, "CustomFunction" will be executed after GetDetails has run.

When using the scraper the user will be asked in XBMC to select one of the movies from the list which was returned by this function. The one which gets selected will be processed from there on.

<GetDetails>

Here the actual data which is being saved in the database is being retrieved. For movies the following format should be returned:

<xml> <details> <title></title> <year></year> <director></director> <top250></top250> <mpaa></mpaa> <tagline></tagline> <runtime></runtime> <thumb></thumb> <credits></credits> <rating></rating> <votes></votes> <genre></genre> <actor> <name></name> <role></role> </actor> <outline></outline> <plot></plot> </details> </xml>

Inputs:

  • $$x where x= 1 to 9: Variable x holds the content of the URL specified in the <url> tag number x in GetSearchResults.

e.g. if there are two <url> tags (without argument function) returned by GetSearchResults. Variable 1 will contain the content of the first url tag and variable 2 the content of the second url tag.

<CustomFunction>

This function should return an xml similar to GetDetails. For tags that can have multiple values the values of CustomFunction and GetDetails are combined (genre etc). For the rest the last CustomFunction will take precedence.

Inputs:

  • $$1: Variable 1 holds the content of the URL specified in the url tag with the function="CustomFunction" argument.

e.g. for <xml><url function="CustomFunction">http://www.imdb.com/title/tt0452624/</url></xml> variable 1 would hold the content of "http://www.imdb.com/title/tt0452624/".