You are here

Exclude argument and wildcard

4 posts / 0 new
Last post
Graham
Offline
Donator
Joined: 11 years
Last seen: 8 months
Exclude argument and wildcard

I will be grateful for any help.

I am trying to scrape the show title from the text in the body of the webpage.  The Webgrab log has ...

scrub strings:
     type & arguments : single(exclude="<a href=[*]>" debug.4)
     blockstart   (bs): <header>
     elementstart (es): <h1 itemprop="name">
     elementend   (ee): </a>
     blockend     (be): </h1>

Separated html block(s), number of blocks = 1
----------begin--block----------

 
   <h1 itemprop="name"><a href="http://www.webgrabplus.com/programme/ypp/sons-of-anarchy">Sons of Anarchy</a></h1>
----------end----block----------

Separated Element(s) (es) applied
----------begin--element----------
<a href="http://www.webgrabplus.com/programme/ypp/sons-of-anarchy">Sons of Anarchy</a></h1>
----------end----element----------

Separated Element(s) (es) and (ee) applied of block 0
----------begin--element----------
<a href="http://www.webgrabplus.com/programme/ypp/sons-of-anarchy">Sons of Anarchy
----------end----element----------

Argument -exclude- , string value = "<a href=[*]>" debug.4

Separated Element(s) arguments include and exclude applied of block 0
----------begin--element----------
<a href="http://www.webgrabplus.com/programme/ypp/sons-of-anarchy">Sons of Anarchy
----------end----element----------

Elements , type single applied
----------begin--element----------
<a href="http://www.webgrabplus.com/programme/ypp/sons-of-anarchy">Sons of Anarchy
----------end----element----------

It appears to ignore the "exclude".  I have tried the "exclude" without the wildcard ... single(exclude="<a href="http://www.webgrabplus.com/%20debug.4%29%20...%20and%20the%20"exclude" is still ignored.

What do I need to do to get "exclude" to work?

Please show me an example for the wildcard in the "exclude"?

Many thanks.

Graham

Graham
Offline
Donator
Joined: 11 years
Last seen: 8 months

Nevermind.

I am getting the result that I need with ... 

title.scrub {single|<header>|<h1 itemprop="name">|</a>|</h1>}

and

title.modify {remove(type=regex)|"(<.*>)"}

Thanks

Graham

francis
Offline
francis's picture
Has donated long time agoWG++ Team member
Joined: 12 years
Last seen: 1 week
Is the support helpful?
support us

FYI:

The regex for removing html tags, is

title.modify {remove(type=regex)|"(<[^>]*>)"}

Just a little bit safer. Because your regex, will also remove all off <a ....>the title</a>.

But maybe in your own case, this is not an issue.

Graham
Offline
Donator
Joined: 11 years
Last seen: 8 months

Thanks for the regex.  I can see why yours is better than mine.

For anyone who stumbles upon this post while trying to use regex, I found a couple of helpful debugging sites at ...

http://www.regexr.com/
https://regex101.com/

I have been looking at this because I see a couple of issues with the stock radiotimes.com.ini.  

This morning, the stock radiotimes.com.ini ( * @Revision 9 - [03/12/2013] ) produced ...

    <title lang="en">Eddie Stobart: Trucks and Trailers 26 May 2015 Spike!??! Series 2 - Episode 8</title>
and
    <sub-title lang="en">. A Horse Walks into a Bar</sub-title>

The leading dot space in sub-title was discussed at
http://www.webgrabplus.com/comment/1627#comment-1627
but may not have found its way into the .ini.

My effort in the posts above was a workaround for the ugly values in <title> from the index page.

Thanks for your help.

 

Log in or register to post comments

Brought to you by Jan van Straaten

Program Development - Jan van Straaten ------- Web design - Francis De Paemeleere
Supported by: servercare.nl