You are here

Need help scrubbing

4 posts / 0 new
Last post
Joined: 7 years
Last seen: 2 years
Need help scrubbing

Hey! :)

I noticed both of the Japanese ini files aren't working anymore, so I set out to modify the one that seemed easiest to repair:
Since I have never worked with WebGrab+ before, I have zero experience in any of this and first read through section 4 of the manual. Still, almost nothing I try seems to work and apart from the title and time (both done by Blackbear199 and working with minimal modification) I don't get it to scrub anything - sad :\

I hope somebody can help me get the basic information and I'll do my best to learn a thing or two along the way.
Since the language barrier is a thing and may keep people from tackling Japanese sites, I went and colored an example page to show what is needed.

The site and url block is working, the URL is still correct.
Example URL for the attached pictures:

  • Picture 1:
    The title is green, start and end time are yellow - Both are still working, no need to update!
    The categories are red. Each category is a <li>, divided into a main and a sub-category by "/ ", I don't think it will be necessary to separate subcategories, though.
  • Picture 2:
    If you click on each link you get to this page containing additional info.
    I colored in the title, time and categories again, but we already have them on the previous page.

    I don't even know if it is possible to grab the picture?
    Actors are purple. All people are generally shown on the right with little pictures below the blue header of "関連人物",
    only for movies there is also a little section on the left with the title "監督・出演" (Director, Actors).
    The director follows "監督:", then a full width space " ", and after "出演:" the actors, each separated by three or four spaces.
    Example:  監督:ジョージ・ミラー 出演:メル・ギブソン ブルース・スペンス ヴァーノン・ウェルズ
    The full description (dark blue) is followed by one of these headings: ≪番組内容≫, 番組内容, 詳細
    and if none of the above exist, the one in light blue always does: 概要 (short synopsis)
    Date (製作年) and Country (製作国) also appear in the text on the left below their titles.

Please only implement what you find easy to do, no need for everything in here - I just thought I'd illustrate all the available info that is there. And of course: Thanks a lot! You can't imagine how happy you'd make me to have working Japanese EPG again! :)

Joined: 7 years
Last seen: 2 years

Wow, that was quite fast, thank you! :)

  1. It is indeed working nicely already, but you're not getting the description. Where did you see みどころ\|ストーリー?
    The description is always below ≪番組内容≫, 番組内容, or sometimes 詳細 and if none of these exist, there's always a synopsis below 概要 on every page.
  2. I don't think the site has a dedicated block for episode number or title. Sadly the only way they have it at all is in the title itself and following an inconsistent pattern that may be next to impossible to scrub :\

    シンプソンズ シーズン28 #16「キャンプ・クラスティの悪夢」[字]
    translates to: Simpsons Series28 #16「Nightmare in Kamp Krusty」[subtitled]
    アメリカン・アイドル シーズン15#13【セミファイナリスト・パフォーマンス】[字]
    American Idol Season15#13【Semifinals Performances】[subtitled]
    New Girl ~ダサかわ女子と三銃士 S6 #16「立派な愛の神」[字]
    New Girl S6 #16

    Drop it and leave it in the title, I guess? :D

  3. Every xmltv_id already contains the site_id.You can still find it in the small channel titles on All of them have two lines, the first one being the channel id. So basically, if I understand this correctly, the title is in every block <div class="basicInfo">, element <div class="text"> until the first <br>. The only exception is the very first channel BS200(BS10), you'd probably have to stop it at "(" there?

Thanks again for your time, we're close to having a working Japanese EPG source again :)

Joined: 7 years
Last seen: 2 years

I get "no shows in indexpage!" now for every channel I try. Is it working for you? :o

Ah, yeah I understood the description.scrub part, I just copied みどころ\|ストーリー from the code to ask where you found either of them. But you're right, some channels (at least the Star Channels) may use that pattern.  みどころ is the general series description, whereas ストーリー (lit. Story) is the episode plot, so みどころ is quite useless as an episode text.

description.scrub {regex||<b>(?:概要\|みどころ\|ストーリー\|≪番組内容≫\|番組内容\|詳細)</b>[^<]*<p[^>]*>([^<]*)</p>||}
So, does this work like a hierarchy, choose the first one that pops up as the description text? If so, wouldn't it be better to put the 概要 part last, as it is the least descriptive text but is on every page? I'd do 番組内容\|≪番組内容≫\|ストーリー\|詳細\|概要
Or wait... is it in the other direction? :D

To be quite honest, I didn't understand your reply on the channels.xml creation or why you couldn't use the first line of that box - but as far as I see, the channels have not changed anyway :)

Joined: 7 years
Last seen: 2 years

Yep, that was it, it's working fine now. There seem to be some errors with the description part (only minor stuff), and I haven't had the chance to test the episode and subtitle yet, but I'll have to stop for today – it's 00:30 where I live :D

Thanks again, that's already more than I could have asked for and I'll check back tomorrow so we can finish this thing :)

Log in or register to post comments

Brought to you by Jan van Straaten

Program Development - Jan van Straaten ------- Web design - Francis De Paemeleere
Supported by: