Scrapy Intro ============ Scrapy is a framework designed for retrieving information from websites. The basic idea of scrapy is to be a bot that goes through websites, getting pages, extracting links and items (which can be any kind of information). The framework is formed by components that take care of different activities. These components are *basically*: * *Core* * *Spiders* * *Items* * *Selectors* * *Adaptors* Core ---- Scrapy's core is the most important component (obviously) and the one in charge of getting/receiving information, and delivering it to your spiders (among many other things). However, we won't talk about the core here, because it doesn't really concern anyone who wants to write spiders, so we'll move on. Spiders ------- We'll start off by the spiders because they're the ones that actually use the other components, and they are used themselves by scrapy's core, so they must be the first for you to know about. Spiders are little programs, let's say, whose purpose is to scrape information from html pages or other data sources. Having said that, it's obvious that their process is something like: 1. Request for information and receive it. 2. Find what you were looking for in the response you got. 3. Adapt it (or not). 4. Store it. Of course, you won't have to deal with sending requests and receiving responses yourself. Scrapy will do that job for you, so that you can concentrate in the one thing that probably brought you here: the information. In order to tell Scrapy how do you want to handle your information, you must define a Spider. This is done by making a class that heredates from BaseSpider. BaseSpider is a really basic class (as its name suggests), so it doesn't bring you much functionality for scraping HTML pages (since they require you to follow links, and differentiate many layouts), but it may be useful in case you're doing something totally different, or in case you want to write your own way of crawling. In other words, BaseSpider is the most simple spider. It has only these attributes: * domain_name: which indicates the domain of the website you'd like to scrape in a string. * start_urls: a list containing the urls you'd like to set as your enter point to the website. * download_delay: an (optional) delay in seconds to wait before sending each request. And these two methods: * init_domain: it's called as soon as the spider is loaded, and it could be useful for initializing stuff. * parse: this is the most important part of the spider, and it's where any response arrives. So now you should figure out how spiders work, but just to make it sure, I'll show you an example:: from scrapy import log # This module is very useful for printing debug information from scrapy.spider import BaseSpider class MySpider(BaseSpider): domain_name = 'http://www.mywebsite.com' start_urls = [ 'http://www.mywebsite.com/1.html', 'http://www.mywebsite.com/2.html', 'http://www.mywebsite.com/3.html', ] def parse(self, response): log.msg('Hey! A response from %s has just arrived!' % response.url) return [] One **VERY** important thing that I didn't mention before, is the fact that the *parse* method must **always** return a list. Always! This list may contain either Requests and/or Items (we'll talk about them right now, don't worry). Now, anybody can fetch web pages -you'll think-, and it's true, you won't get too far by just dealing with that. And that's when the selectors appear... Selectors --------- Selectors are *the* way you have to extract information from documents. They retrieve information from the response's body, given an XPath, or a Regular Expression that you provide. Currently there are two kinds of selectors, HtmlXPathSelectors, and XmlXPathSelectors. Both work in the same way; they are first instanciated with a response, for example:: hxs = HtmlXPathSelector(response) Now, before going on with selectors, I must tell you about a pretty cool feature that Scrapy has, and which you'll surely find very useful whenever you're writing spiders. This feature is the Scrapy shell, and you can use it by calling your project manager with the 'shell' argument; something like:: [user@host ~/myproject]$ ./scrapy-manager.py shell Notice that you'll have to install IPython in order to use this feature, but believe me that it worths it; the shell is **very** useful. With the shell you can simulate parsing a webpage, either by calling "scrapy-manager shell" with an url as an additional parameter, or by using the shell's 'get' command, which tries to retreive the given url, and fills in the 'response' variable with the result. Ok, so now let's use the shell to show you a bit how do selectors work. We'll use an example page located in Scrapy's site (http://docs.scrapy.org/examples/sample1.htm), whose markup is::