Archive:New Video Scanner Ideas

From Official Kodi Wiki
Jump to navigation Jump to search

The source of this new concept

Many of these ideas started been discussed in these five threads in the Kodi feature suggestions forum:

Goals of the new scanner

  • 1. Automate as much as possible the task of getting video files into the library.
  • 2. Be as user-friendly and as automated as is possible.
  • 3. Never leave anything out of the library - anything that's there will go in, whether we can find metadata for it or not.
  • 4. Prioritise the users data over scraped data.
  • 5. Reliably detect when files have moved (to save re-scraping) and when new files have been added or removed and update accordingly.
  • 6. Never overwrite the users own data (either in nfo files or things such as playcounts or watched status).

Implementation ideas

The scanner starts with a path, i.e. a single URL to a local filesystem. Given this the implementation should be as follows:

For each path:

  • 1. Check it needs scanning (see below).
  • 2. If so, enumerate all video files and all folders.
  • 3. For each video file in the folder, check to see whether it's in the library, and if not, obtain metadata (see below).
  • 4. For each subfolder in the folder, recurse in and follow this same routine.

Checking whether a path needs scanning

The database will contain all the paths that have been scanned in the past, and we can use a hashing technique to detect whether the paths need rescanning. The simplest and quickest method would be to do a stat() on the path, and check the modified time. Hopefully all filesystems (on all OSes) will update the modified time of a folder if any of it's contents change (file rename, file addition, file modification, file removal etc.). We then store the path and it's hash in the db and compare to this. Alternative hashing would be to fetch the directory and hash it's contents, but this should be avoided if at all possible (smb in particular is slow at directory fetching when many files are in the folder, primarily due to the need to stat() each file).

Checking to see whether an item is in the library

This could be improved by not only checking to see if the path to the file exists in the library, but also by hashing the file using some reasonably scheme, and storing that in the library. This will enable us to detect whether a file has moved, or whether this is simply a duplicate instance of the file. A suitable hashing algorithm may be that that is used to identify "scene releases" for automatic subtitle downloading.

Obtaining metadata

We start with the URL for a video file. This may consist of a stack or a rar or whatever. In any case, it's treated as a single video file. From here, we must extract metadata information.

For each file that's not already in the database:

  • 1. Check within the file for metadata information (tags in mp4 or mkv's, resolution/audio + video information, duration and the like).
  • 2. Check for a .nfo file containing metadata, or containing URL's to scrape and the like.
  • 3. Generate the hash of the file and see whether this matches known online sources.
  • 4. Scrape the filename for identifying information (eg identify tv episodes by running regexps on them, identify dvd folders and so on).
  • 5. Use the information in 3 to classify, and if possible to do an online search (or searches) for more information.
  • 6. If the searches in 4 were not particularly successful (i.e. the matches were not close to the data from the filename), and if this is the only video file in this path then repeat from 3 with the parent folder included in the filename scraping procedure.
  • 7. Take the best scrape thus far and check whether it's a likely match. If not, discard the information.
  • 8. Insert all the information into the database. At minimum, the filename as title could be added.

Hopefully, the checks above should be enough to both identify what type of content the file contains and what the file is. It may well be that we may require additional hints from the user (eg specify content type at the root folder level) but we should try to avoid this if possible. It's better to err on the side of not fetching any metadata than misclassifying data.

XBMC should allow full editing of database information in order to fix any misclassifications that may occur.