Dection rules are meant to specify what to grab on a page when automatic detection is not working properly on a particular website.
It's a little bit technical as it involves regular expressions and xpath queries. But it provides a way to grab content no matter how the site is built.
The detection rules are configured in the option screen accessible by the Options menu item.
The options screen appears.
Then clicking on the detection rules button make them appear.
To add a new rule click on the Add rule button.
A form appears to let you describe the rule.
First you have to enter a name for the rule. It is just a name that will appear in left list once the rule is added.
The second field is the regular expression. Everytime a content is about to be grabbed from a page (or a feed), if the url of the page matches the entered regular expression, then the rule will be applied.
In the above example we specify that this rule should apply to every page of which the url contains 'name_of_site'.
The xPath field lets you specify which part of the page should be grabbed.
In the above example you tell to grab the nodes with id equal to 'article': //*[@id='article']
Another example would be to grab all nodes with class 'myClass': //*[@class='myClass']
In most cases the zone you want to grab has an id, or a class. If because of bad luck the page doesn't use any id, class or other attribute that could easily identify the zone to grab, then you will have to use the full xPath path to the html element. The xPath to set in GrabMyBooks in case of full xPath path starts after the body element. Example: /html/body/table/tbody/tr/td. You must enter: /table/tbody/tr/td.
Here are some examples on how to precisly select the content to grab.
- The zone has an id: //*[@id='zone_id']
- The zone has a distinctive class: //*[@class='zone_class']
- The zone has a distinctive class among other classes: //*[contains(@class,'zone_class'')]
- The zone has an id but the end of the content is not interesting: //*[@id='zone_id']/*[position()<20]
- The zone has an id but the beginning of the content is not interesting: //*[@id='zone_id']/*[position()>5]
- There are two zones to grab on the same page: //*[@id='zone1_id'] | //*[@id='zone2_id']
- To grab an image with url: //img[contains(@src,'url_part')]
- There is no way to identify the zone by id or class or whatever. Use the full xPath (after the body element): /table/tbody/tr/td
Once a rule is added after a click on Ok, it appears on screen in the left list. By clicking on it it is possible to edit it and more importantly to change its position in the list.
A page could match more than one regular expression which means it could be compatible with more than one detection rule. In such a case, the rule that is the most on top in the list is going to be used.
In the options screen, rules can be defined to precisely describe which part(s) of the page must be grabbed on a particular website. It is a bit technical as xPath is used to tell which zones are to grab. The rule editor allows the user to define rules in an graphical way, just using the mouse. When you are on a page for which the automatic detection doesn't suit you, click on the Rule editor menu item. The rule editor panel appears on screen.
You can move the panel on each side of the screen by clicking on the green arrows.
As you move the mouse on the screen, the parts grabbable by GrabMyBooks are highlighted and the computed xPath rule is displayed in blue on the panel.
To see which zone would have been automatically detected by GrabMyBooks check the Show default selection box. The zone appears in gray on screen.
To select the zone(s) that you want to be part of the rule just click on them. They are outlined in red and are added in the rule editor panel.
If a zone has sub elements you can decide not to take into account the first few ones or the last few ones by clicking on the related buttons.
The two buttons on the top left corner enable you to not select the first few sub elements. The two buttons on the bottom right enable you to not select the last few sub elements. The button on the top right corner restores the bounds of the sub selection.
To deselect a zone, click again on it or close the corresponding info box in the rule editor panel.
You can select as many zones as you want.
Test your selection by clicking on the Add to book button. The grabbed content is added to the current book.
Then save your rule by clicking on the Save rule button. The detection rule part of the options screen appears and let you save the rule. You can override an existing rule if you want.
From now on, for each page of the same website, GrabMyBooks will try and detect the zones you described.
On this web site the default detection doesn't suit me because the article image is not grabbed.
So I decide to create a rule to include it. One click on the image and one click on the article text.
I test the rule by clicking on the Add to book button. It seems ok, the rule can then be saved by clicking on the Save rule button.
And it works also with other pages from the same website.
Next page rule
On some websites, articles are spread accross several pages. This can make it a bit difficult to have to grab each separated page to get the whole article. This is especially true when you come on a regular basis on the same website to grab content.
In a rule it is possible to specify how to reach the next page of an article. When adding/editing a rule, the field dedicated to the detection of the next page is named Next page href is located at xPath.
This field can contain an attribute xpath rule that would give the address of the next article page. Example: //a[@class='next']/@href. This says that the address of the next page of the article is to be found in a link having the class 'next'.
Of course this varies depending on the website.
Link on page rule
This is useful for feeds grabbing. Sometimes, links in feed entries lead to an intermediate page with an other link to the actual article. This can also be useful for websites aggregating content from other websites. They display the first few lines of the article and then there is a link to reach the full content.
In such cases one can define a rule to grab not the page directly but a link on the page. When adding/editing a rule, the field dedicated to the detection of the link to grab instead of the page is named Don't grab the page itself but a link on the page of which href is located at xPath
This field can contain an attribute xpath rule that would give the address of the article to grab. Example: //a/@href. This says that the address of the article to grab instead of the current page is to be taken from the first link encountered on the page.
Of course this varies depending on the website.