brian d foy > grepurl > grepurl

Download:
grepurl-1.02.tar.gz

Annotate this POD

View/Report Bugs
Source  

NAME ^

grepurl - print links in HTML

SYNOPSIS ^

        grepurl [-bdv] [-e extension[,extension] [-E extension[,extension]
                [-h host[,host]] [-H host[,host]] [-p regex] [-P regex]
                [-s scheme[,scheme]] [-s scheme[,scheme]] [-u URL]

DESCRIPTION ^

The grepurl program searches through the URL specified in the -u switch and prints the URLs that satisfies the given set of options. It applies the options roughly in order of which part of the URL the option affects (scheme, host, path, extension).

So far, grepurl expects to search through HTML, although I want to add other content types, especially plain text, RSS feeds, and so on.

OPTIONS ^

-a

arrange (sort) links in ascending order

-A

arrange (sort) links in descending order

-b

turn relative URLs into absolute ones

-d

turn on debugging output

-e EXTENSION

select links with these extensions (comma separated)

-E EXTENSION

exclude links with these extensions (comma separated)

-h HOST

select links with these hosts (comma separated)

-H HOST

exclude links with these hosts (comma separated)

-p REGEX

select only paths that match this Perl regex

-P REGEX

exclude paths that match this Perl regex

-r REGEX

select only URLs that match this Perl regex (applies to entire URL)

-R REGEX

exclude URLs that match this Perl regex (applies to entire URL)

-s SCHEME

select only these schemes (comma separated)

-S SCHEME

exclude these schemes (comma separated)

-t FILE

extract URLs from plain text file (not implemented)

-u URL

extract URLs from URL (may be file://), expects HTML

-v

turn on verbose output

-1

print found URLs only once (print a unique list)

Examples

Print all the links
        grepurl -u http://www.example.com/
Print all the links, and resolve relative URLs
        grepurl -b -u http://www.example.com/
Print links with the edxtension .jpg
        grepurl -e jpg -u http://www.example.com/
Print links with the edxtension .jpg and .jpeg
        grepurl -e jpg,jpeg -u http://www.example.com/
Do not print links with the extension .cfm or .asp
        grepurl -E cfm,asp -u http://www.example.com/
Print only links to www.panix.com
        grepurl -h www.panix.com -u http://www.example.com/
Print only links to www.panix.com or www.perl.com
        grepurl -h www.panix.com,www.perl.com -u http://www.example.com/
Do not print links to www.microsoft.com
        grepurl -H www.microsoft.com -u http://www.example.com/
Print links with "perl" in the path
        grepurl -p perl -u http://www.example.com
Print links with "perl" or "pearl" in the path
        grepurl -p "pea?rl" -u http://www.example.com
Print links with "fred" or "barney" in the path
        grepurl -p "fred|barney" -u http://www.example.com
Do not print links with "SCO" in the path
        grepurl -P SCO -u http://www.example.com
Do not print links whose path matches "Micro.*"
        grepurl -P "Micro.*" -u http://www.example.com
Do not print links whose URL matches "Micro.*" anywhere
        grepurl -R "Micro.*" -u http://www.example.com
Print only web links
        grepurl -s http -u http://www.example.com/
Print ftp and gopher links
        grepurl -s ftp,gopher -u http://www.example.com/
Exclude ftp and gopher links
        grepurl -S ftp,gopher -u http://www.example.com/
Arrange the links in an ascending sort
        grepurl -a -u http://www.example.com/
Arrange the links in an descending sort
        grepurl -A -u http://www.example.com/
Arrange the links in an descending sort, and print unique URLs
        grepurl -A -1 -u http://www.example.com/

TO DO ^

Operate over an entire directory or website

SEE ALSO ^

urifind by darren chamberlain <darren@cpan.org>

SOURCE AVAILABILITY ^

This source is in Github

        https://github.com/briandfoy/grepurl

AUTHOR ^

brian d foy, <bdfoy@cpan.org>

COPYRIGHT ^

Copyright 2004-2014, brian d foy, All rights reserved.

You may use this program under the same terms as Perl itself.

syntax highlighting: