Spanner
Spanner is a WWW Spider which has been in occasional production since
February 1996. It was designed to be a general indexer and link
checker.
Please note: Spanner isn't actively being worked on/supported
anymore, sorry.
Well, I was going to wait to make the code a bit more stream-lined,
and modular, but ... I've been saying that I would do that for the
past few months, and I don't for-see myself having the time in the
near future, so ... The 1.0 version of Spanner is released. You can
grab it from the kluge.net FTP site under /NES/.
Current Features of Spanner
- Timeouts for DNS Lookups
- Timeouts for HTTP Connections
- Pause between hits to WWW server
- Support for Robots
Exclusion, including the Robots META tag
- User defined limits on URL retrieval and addition
- Can limit retriveal/addition based on IP
- Use of mime.types file to limit non-text based requests
- Returned data can be user parsed before actual parse function is
used.
- Unreachable hosts detected, no further requests are made to that
host
- Request and link logging.
- Smart URL filtering: 'URL#name' will simply mean that URL will be
indexed (if it hasn't been already). Relative paths (including '.' and
'..') are handled.
- META tags (such as keywords and description) are recognized.