Linux cli command HTML_LinkExtractorpm

➡ A Linux man page (short for manual page) is a form of software documentation found on Linux and Unix-like operating systems. This man-page explains the command HTML_LinkExtractorpm and provides detailed information about the command HTML_LinkExtractorpm, system calls, library functions, and other aspects of the system, including usage, options, and examples of _. You can access this man page by typing man followed by the HTML_LinkExtractorpm.

NAME 🖥️ HTML_LinkExtractorpm 🖥️

Extract links from an HTML document

DESCRIPTION

HTML::LinkExtractor is used for extracting links from HTML. It is very similar to HTML::LinkExtor, except that besides getting the URL, you also get the link-text.

Example ( please run the examples ):

use HTML::LinkExtractor; use Data::Dumper; my $input = q{If <a href=“http://perl.com/"> I am a LINK!!! </a>}; my $LX = new HTML::LinkExtractor(); $LX->parse(\input); print Dumper($LX->links); _ _END_ _ # the above example will yield $VAR1 = [ { _TEXT => <a href=“http://perl.com/"> I am a LINK!!! </a>, href => bless(do{\my $o = http://perl.com/)}, URI::http), tag => a } ];

HTML::LinkExtractor will also correctly extract nested link-type tags.

SYNOPSIS

## the demo perl LinkExtractor.pm perl LinkExtractor.pm file.html othefile.html ## or if the module is installed, but you dont know where perl -MHTML::LinkExtractor -e” system $^X, $INC{q{HTML/LinkExtractor.pm}} " perl -MHTML::LinkExtractor -e system $^X, $INC{q{HTML/LinkExtractor.pm}} ## or use HTML::LinkExtractor; use LWP qw( get ); # use LWP::Simple qw( get ); my $base = http://search.cpan.org; my $html = get($base./recent); my $LX = new HTML::LinkExtractor(); $LX->parse(\html); print qq{<base href="$base”> }; for my $Link( @{ $LX->links } ) { ## new modules are linked by /author/NAME/Dist if( $$Link{href}=~ m{^\author\w+} ) { print $$Link{_TEXT}." “; } } undef $LX; _ _END_ _ ## or use HTML::LinkExtractor; use Data::Dumper; my $input = q{If <a href=“http://perl.com/"> I am a LINK!!! </a>}; my $LX = new HTML::LinkExtractor( sub { print Data::Dumper::Dumper(@_); }, http://perlFox.org/, ); $LX->parse(\input); $LX->strip(1); $LX->parse(\input); _ _END_ _ #### Calculate to total size of a web-page #### adds up the sizes of all the images and stylesheets and stuff use strict; use LWP; # use LWP::Simple; use HTML::LinkExtractor; # my $url = shift || http://www.google.com; my $html = get($url); my $Total = length $html; # print “initial size $Total “; # my $LX = new HTML::LinkExtractor( sub { my( $X, $tag ) = @_; # unless( grep {$_ eq $tag->{tag} } @HTML::LinkExtractor::TAGS_IN_NEED ) { # print “$$tag{tag} “; # for my $urlAttr ( @{$HTML::LinkExtractor::TAGS{$$tag{tag}}} ) { if( exists $$tag{$urlAttr} ) { my $size = (head( $$tag{$urlAttr} ))[1]; $Total += $size if $size; print “adding $size " if $size; } } } }, $url, 0 ); # $LX->parse(\html); # print “The total size of $url is $Total bytes “; _ _END_ _

METHODS

“$LX->new([\callback, [$baseUrl, [1]]])”

Accepts 3 arguments, all of which are optional. If for example you want to pass a $baseUrl, but don’t want to have a callback invoked, just put undef in place of a subref.

This is the only class method.

  1. a callback ( a sub reference, as in sub{}, or \&sub) which is to be called each time a new LINK is encountered ( for @HTML::LinkExtractor::TAGS_IN_NEED this means after the closing tag is encountered ) The callback receives an object reference($LX) and a link hashref.

  2. and a base URL ( URI->new, so its up to you to make sure it’s valid which is used to convert all relative URI’s to absolute ones. $ALinkP{href} = URI->new_abs( $ALink{href}, $base );

  3. A boolean (just stick with 1). See the example in DESCRIPTION. Normally, you’d get back _TEXT that looks like _TEXT => <a href=“http://perl.com/"> I am a LINK!!! </a>, If you turn this option on, you’ll get the following instead _TEXT => I am a LINK!!! , The private utility function _stripHTML does this by using HTML::TokeParsers method get_trimmed_text. You can turn this feature on an off by using $LX->strip(undef || 0 || 1)

“$LX->parse( $filename || *FILEHANDLE || \FileContent )”

Each time you call parse, you should pass it a $filename a *FILEHANDLE or a \$FileContent

Each time you call parse a new HTML::TokeParser object is created and stored in $this->{_tp}.

You shouldn’t need to mess with the TokeParser object.

Only after you call parse will this method return anything. This method returns a reference to an ArrayOfHashes, which basically looks like (Data::Dumper output)

$VAR1 = [ { tag => img, src => image.png }, ];

Please note that if yo provide a callback this array will be empty.

“$LX->strip( [ 0 || 1 ])”

If you pass in undef (or nothing), returns the state of the option. Passing in a true or false value sets the option.

If you wanna know what the option does see $LX->new([\&callback, [$baseUrl, [1]]])

WHAT’S A LINK-type tag

Take a look at %HTML::LinkExtractor::TAGS to see what I consider to be link-type-tag.

Take a look at @HTML::LinkExtractor::VALID_URL_ATTRIBUTES to see all the possible tag attributes which can contain URI’s (the links!!)

Take a look at @HTML::LinkExtractor::TAGS_IN_NEED to see the tags for which the _TEXT attribute is provided, like <a href="#"> TEST </a>

How can that be?!?!

I took at look at %HTML::Tagset::linkElements and the following URL’s

http://www.blooberry.com/indexdot/html/tagindex/all.htm http://www.blooberry.com/indexdot/html/tagpages/a/a-hyperlink.htm http://www.blooberry.com/indexdot/html/tagpages/a/applet.htm http://www.blooberry.com/indexdot/html/tagpages/a/area.htm http://www.blooberry.com/indexdot/html/tagpages/b/base.htm http://www.blooberry.com/indexdot/html/tagpages/b/bgsound.htm http://www.blooberry.com/indexdot/html/tagpages/d/del.htm http://www.blooberry.com/indexdot/html/tagpages/d/div.htm http://www.blooberry.com/indexdot/html/tagpages/e/embed.htm http://www.blooberry.com/indexdot/html/tagpages/f/frame.htm http://www.blooberry.com/indexdot/html/tagpages/i/ins.htm http://www.blooberry.com/indexdot/html/tagpages/i/image.htm http://www.blooberry.com/indexdot/html/tagpages/i/iframe.htm http://www.blooberry.com/indexdot/html/tagpages/i/ilayer.htm http://www.blooberry.com/indexdot/html/tagpages/i/inputimage.htm http://www.blooberry.com/indexdot/html/tagpages/l/layer.htm http://www.blooberry.com/indexdot/html/tagpages/l/link.htm http://www.blooberry.com/indexdot/html/tagpages/o/object.htm http://www.blooberry.com/indexdot/html/tagpages/q/q.htm http://www.blooberry.com/indexdot/html/tagpages/s/script.htm http://www.blooberry.com/indexdot/html/tagpages/s/sound.htm And the special cases <!DOCTYPE HTML SYSTEM “http://www.w3.org/DTD/HTML4-strict.dtd"> http://www.blooberry.com/indexdot/html/tagpages/d/doctype.htm !doctype is really a process instruction, but is still listed in %TAGS with url as the attribute and <meta HTTP-EQUIV=“Refresh” CONTENT=“5; URL=http://www.foo.com/foo.html”> http://www.blooberry.com/indexdot/html/tagpages/m/meta.htm If there is a valid url, url is set as the attribute. The meta tag has no attributes listed in %TAGS.

SEE ALSO

HTML::LinkExtor, HTML::TokeParser, HTML::Tagset.

AUTHOR

D.H (PodMaster)

Please use http://rt.cpan.org/ to report bugs.

Just go to http://rt.cpan.org/NoAuth/Bugs.html?Dist=HTML-Scrubber to see a bug list and/or repot new ones.

LICENSE

Copyright (c) 2003, 2004 by D.H. (PodMaster). All rights reserved.

This module is free software; you can redistribute it and/or modify it under the same terms as Perl itself. The LICENSE file contains the full text of the license.

░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░

  █║▌│║█║▌★ KALI ★ PARROT ★ DEBIAN 🔴 PENTESTING ★ HACKING ★ █║▌│║█║▌

              ██╗ ██╗ ██████╗  ██████╗ ██╗  ██╗███████╗██████╗
             ████████╗██╔══██╗██╔═══██╗╚██╗██╔╝██╔════╝██╔══██╗
             ╚██╔═██╔╝██║  ██║██║   ██║ ╚███╔╝ █████╗  ██║  ██║
             ████████╗██║  ██║██║   ██║ ██╔██╗ ██╔══╝  ██║  ██║
             ╚██╔═██╔╝██████╔╝╚██████╔╝██╔╝ ██╗███████╗██████╔╝
              ╚═╝ ╚═╝ ╚═════╝  ╚═════╝ ╚═╝  ╚═╝╚══════╝╚═════╝

               █║▌│║█║▌ WITH COMMANDLINE-KUNGFU POWER █║▌│║█║▌

░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░