urifind - find URIs in a document and dump them to STDOUT.
$ urifind file
urifind is a simple script that finds URIs in one or more files (using URI::Find), and outputs them to to STDOUT. That's it.
URI::Find
To find all the URIs in file1, use:
$ urifind file1
To find the URIs in multiple files, simply list them as arguments:
$ urifind file1 file2 file3
urifind will read from STDIN if no files are given or if a filename of - is specified:
STDIN
-
$ wget http://www.boston.com/ -O - | urifind
When multiple files are listed, urifind prefixes each found URI with the file from which it came:
$ urifind file1 file2 file1: http://www.boston.com/index.html file2: http://use.perl.org/
This can be turned on for single files with the -p ("prefix") switch:
-p
$urifind -p file3 file1: http://fsck.com/rt/
It can also be turned off for multiple files with the -n ("no prefix") switch:
-n
$ urifind -n file1 file2 http://www.boston.com/index.html http://use.perl.org/
By default, URIs will be displayed in the order found; to sort them ascii-betically, use the -s ("sort") option. To reverse sort them, use the -r ("reverse") flag (-r implies -s).
-s
-r
$ urifind -s file1 file2 http://use.perl.org/ http://www.boston.com/index.html mailto:webmaster@boston.com $ urifind -r file1 file2 mailto:webmaster@boston.com http://www.boston.com/index.html http://use.perl.org/
Finally, urifind supports limiting the returned URIs by scheme or by arbitrary pattern, using the -S option (for schemes) and the -P option. Both -S and -P can be specified multiple times:
-S
-P
$ urifind -S mailto file1 mailto:webmaster@boston.com $ urifind -S mailto -S http file1 mailto:webmaster@boston.com http://www.boston.com/index.html
-P takes an arbitrary Perl regex. It might need to be protected from the shell:
$ urifind -P 's?html?' file1 http://www.boston.com/index.html $ urifind -P '\.org\b' -S http file4 http://www.gnu.org/software/wget/wget.html
Add a -d to have urifind dump the refexen generated from -S and -P to STDERR. -D does the same but exits immediately:
-d
STDERR
-D
$ urifind -P '\.org\b' -S http -D $scheme = '^(\bhttp\b):' @pats = ('^(\bhttp\b):', '\.org\b')
To remove duplicates from the results, use the -u ("unique") switch.
-u
Sort results.
Reverse sort results (implies -s).
Return unique results only.
Don't include filename in output.
Include filename in output (0 by default, but 1 if multiple files are included on the command line).
Print only lines matching regex '$re' (may be specified multiple times).
Only this scheme (may be specified multiple times).
Help summary.
Display version and exit.
Dump compiled regexes for -S and -P to STDERR.
Same as -d, but exit after dumping.
darren chamberlain <darren@cpan.org>
(C) 2003 darren chamberlain
This library is free software; you may distribute it and/or modify it under the same terms as Perl itself.
To install URI::Find, copy and paste the appropriate command in to your terminal.
cpanm
cpanm URI::Find
CPAN shell
perl -MCPAN -e shell install URI::Find
For more information on module installation, please visit the detailed CPAN module installation guide.