Scrapy – CrawlSpider – parse start url

By default, CrawlSpider will follow and parse pages determined by rules, but it won’t parse starting url.

If you want to parse also starting url you can use parse_start_url method.

def parse_start_url(self, response):
   #code for parsing starting url
  • Date: March 16, 2012
  • Author: Slawek Lukasiewicz
  • Comments: No Comments
  • Category: PHP

PHP 5.4: Built-in web server

Another handy new feature in PHP 5.4 version is built-in web server. Although I use nginx locally, this feature can be really handy when we want to test something quickly, with different configuration.

We can start built-in webserver using simple command

php -S localhost:PORT [-t DOC_ROOT_DIR]

Optionally we can also specify document root directory with -t option.

  • Date: March 16, 2012
  • Author: Slawek Lukasiewicz
  • Comments: No Comments
  • Category: Python

scrapy: stopping spider

Sometimes, in some circumstances, we need to stop execution of scrapy spider. How ? We can achieve this raising CloseSpider exception, which stops spider execution.

Example from documentation

def parse_page(self, response):
    if 'Bandwidth exceeded' in response.body:
        raise CloseSpider('bandwidth_exceeded')