Add post-processing or more specific content selection #2

New Issue

2013-04-09T21:40:51+01:00

mbirth commented

2013-04-09 21:40:51 +01:00

(Migrated from github.com)

N24.de has additional content inside their main article DIV. There should be some filter or a more specific way of selecting the desired content to use.

bfly75 commented

2013-04-14 13:48:19 +01:00

(Migrated from github.com)

I expect this can be done using a more extensively defined xpath query. Below some examples (not N24.de), which might be useful. Today is a slow news day, so I don't know yet whether tt-rss works well with these queries. Based on xpath validators, they should.

edit: ahh, unfortunately this does not seem to work with your code. So far you only use the first entry from the query, instead of adding all of them to the article text.

Selecting several specific divs / tags:
//h1 | //h2 | //h3
//div[@id='artikelKolom']/div[@class='zaktxt clear']/div[@class='zak_normal'] | //div[@id='artikelKolom']/p
Note: sequence matters when doing it like this! //h1 | //h2 | //h3 will show first all h1's, followed by all h2's and then all h3's
//div[@id='artikelKolom']/*[contains(@class,'zaktxt') or name()='p']
Note: sequence does not seem to matter, sequence is based on sequence in file

Select all div's with certain classes. No need for the div's to have the same parent
//div[@class='content illustrated' or @class='post-body']
//div[contains(@class,'illustration top')] | //div[contains(@class,'post-body')]
//div[contains(@class,'illustration top') or contains(@class,'post-body')]
Note: not sure whether sequence matters

Select all children from div id='artikelKolom', except children with div class='broodtxt' or div class='bannercenter ...'
//div[@id='artikelKolom']/[@class!='broodtxt']
//div[@id='artikelKolom']/[not(@class='broodtxt')]
//div[@id='artikelKolom']/[not(contains(@class, 'broodtxt'))]
//div[@id='artikelKolom']/[not(contains(@class, 'broodtxt')) and not(contains(@class, 'bannercenter'))]

I expect this can be done using a more extensively defined xpath query. Below some examples (not N24.de), which might be useful. Today is a slow news day, so I don't know yet whether tt-rss works well with these queries. Based on xpath validators, they should. edit: ahh, unfortunately this does not seem to work with your code. So far you only use the first entry from the query, instead of adding all of them to the article text. Selecting several specific divs / tags: //h1 | //h2 | //h3 //div[@id='artikelKolom']/div[@class='zaktxt clear']/div[@class='zak_normal'] | //div[@id='artikelKolom']/p Note: sequence matters when doing it like this! //h1 | //h2 | //h3 will show first all h1's, followed by all h2's and then all h3's //div[@id='artikelKolom']/*[contains(@class,'zaktxt') or name()='p'] Note: sequence does not seem to matter, sequence is based on sequence in file Select all div's with certain classes. No need for the div's to have the same parent //div[@class='content illustrated' or @class='post-body'] //div[contains(@class,'illustration top')] | //div[contains(@class,'post-body')] //div[contains(@class,'illustration top') or contains(@class,'post-body')] Note: not sure whether sequence matters Select all children from div id='artikelKolom', except children with div class='broodtxt' or div class='bannercenter ...' //div[@id='artikelKolom']/_[@class!='broodtxt'] //div[@id='artikelKolom']/_[not(@class='broodtxt')] //div[@id='artikelKolom']/_[not(contains(@class, 'broodtxt'))] //div[@id='artikelKolom']/_[not(contains(@class, 'broodtxt')) and not(contains(@class, 'bannercenter'))]

mbirth commented

2013-04-14 14:59:55 +01:00

(Migrated from github.com)

I think it'll get too complicated if you need to "puzzle" the result together like this. Also it'll get worse when the source changes its layout (like N24 did some days ago).

Maybe I'll implement a blacklist which will remove certain XPath elements from the result. I think this is more robust.

I think it'll get too complicated if you need to "puzzle" the result together like this. Also it'll get worse when the source changes its layout (like N24 did some days ago). Maybe I'll implement a blacklist which will remove certain XPath elements from the result. I think this is more robust.

Kasad commented

2013-04-19 10:42:35 +01:00

(Migrated from github.com)

A blacklist would be realy nice :D
Also I've a big problem with welt.de ... their feed url links to an overview page... there should be an rewrite of the sourceurl like:
http://www.welt.de/?config=articleidfromurl&artid=115415142
should be
http://www.welt.de/article115415142

Would be phantastic to see this features :D

A blacklist would be realy nice :D Also I've a big problem with welt.de ... their feed url links to an overview page... there should be an rewrite of the sourceurl like: http://www.welt.de/?config=articleidfromurl&artid=115415142 should be http://www.welt.de/article115415142 Would be phantastic to see this features :D

Kasad commented

2013-04-21 11:29:13 +01:00

(Migrated from github.com)

Hi,

is there a way to use all entrys from the query, instead of adding only the first to the article text?

div[@class='news-single-item']/p ==> only returns the first found p content

div[@id='news-single-item']/*[not(div[@class='comments'])] ==> doesn't work :(

Thank you for your answer.

Kasad

Hi, is there a way to use all entrys from the query, instead of adding only the first to the article text? div[@class='news-single-item']/p ==> only returns the first found p content div[@id='news-single-item']/*[not(div[@class='comments'])] ==> doesn't work :( Thank you for your answer. Kasad

bfly75 commented

2013-04-21 11:37:01 +01:00

(Migrated from github.com)

Yes, but you need to make some changes to the init.php file. I did this
last weekend and this week it seems to work as expected. See
https://github.com/bfly75/ttrss_plugin-af_feedmod.

On Sun, Apr 21, 2013 at 12:29 PM, Kasad notifications@github.com wrote:

Hi,

is there a way to use all entrys from the query, instead of adding only
the first to the article text?

div[@class https://github.com/class='news-single-item']/p ==> only
returns the first found p content

div[@id https://github.com/id='news-single-item']/*[not(div[@classhttps://github.com/class='comments'])]
==> doesn't work :(

Thank you for your answer.

Kasad

—
Reply to this email directly or view it on GitHubhttps://github.com/mbirth/ttrss_plugin-af_feedmod/issues/2#issuecomment-16719473
.

Ronald Capel
Wilhelminaplein 127, 4201 GW Gorinchem, The Netherlands
(maphttp://maps.google.nl/maps?f=q&source=s_q&hl=en&geocode=&q=Wilhelminaplein+127,+Gorinchem&aq=0&sll=52.27488,5.515137&sspn=3.97308,9.876709&ie=UTF8&hq=&hnear=Wilhelminaplein+127,+Gorinchem,+Zuid-Holland&ll=51.827477,4.973845&spn=0.007838,0.01929&t=h&z=16
|park http://www.ronaldcapel.nl/prive/parkeren)
Mob: +31-(0)6-55836128 Email: r.capel@b-fly.nl

Yes, but you need to make some changes to the init.php file. I did this last weekend and this week it seems to work as expected. See https://github.com/bfly75/ttrss_plugin-af_feedmod. On Sun, Apr 21, 2013 at 12:29 PM, Kasad notifications@github.com wrote: > Hi, > > is there a way to use all entrys from the query, instead of adding only > the first to the article text? > > div[@class https://github.com/class='news-single-item']/p ==> only > returns the first found p content > > div[@id https://github.com/id='news-single-item']/*[not(div[@classhttps://github.com/class='comments'])] > ==> doesn't work :( > > Thank you for your answer. > > Kasad > > — > Reply to this email directly or view it on GitHubhttps://github.com/mbirth/ttrss_plugin-af_feedmod/issues/2#issuecomment-16719473 > . ## _Ronald Capel_ Wilhelminaplein 127, 4201 GW Gorinchem, The Netherlands (maphttp://maps.google.nl/maps?f=q&source=s_q&hl=en&geocode=&q=Wilhelminaplein+127,+Gorinchem&aq=0&sll=52.27488,5.515137&sspn=3.97308,9.876709&ie=UTF8&hq=&hnear=Wilhelminaplein+127,+Gorinchem,+Zuid-Holland&ll=51.827477,4.973845&spn=0.007838,0.01929&t=h&z=16 |park http://www.ronaldcapel.nl/prive/parkeren) Mob: +31-(0)6-55836128 Email: r.capel@b-fly.nl

Kasad commented

2013-04-21 12:50:03 +01:00

(Migrated from github.com)

Wow, thank you very much - this works awesome :D

uusijani commented

2013-04-26 13:33:49 +01:00

(Migrated from github.com)

I think post-processing should also rip out (at least) id, class and style attributes from the content. Some pages I fetch using feedmod have elements with ids such as "overlay" in them that pick up tt-rss's styling, making things look wonky.

tbar commented

2013-05-07 08:10:41 +01:00

(Migrated from github.com)

@bfly75: Thanks for that modification!
@mbirth: You should consider incorporating bfly75's modification. Maybe by creating a new type (eg. xpath-all-matches).

@bfly75: Thanks for that modification! @mbirth: You should consider incorporating bfly75's modification. Maybe by creating a new type (eg. xpath-all-matches).

mbirth commented

2013-06-20 11:03:57 +01:00

(Migrated from github.com)

I just merged changes from @rangerer which add a new "cleanup" option to remove unwanted parts from the main XPath node. He also has provided a lot of examples.

mbirth commented

2013-07-30 15:54:20 +01:00

(Migrated from github.com)

Another thing this one should do: Make all URLs absolute (i.e. fully qualified including "http://www.example.org/) because like in #22, relative images are not shown.

Kasad commented

2013-08-13 22:50:02 +01:00

(Migrated from github.com)

Hi,

after my ttrss crashed I couldn't use the version of bfly75 any longer. Could you please add his way to display more than one div?

Greetings
K

Hi, after my ttrss crashed I couldn't use the version of bfly75 any longer. Could you please add his way to display more than one div? Greetings K

This repo is archived. You cannot comment on issues.

1 Participants

Due Date

No due date set.

Dependencies

No dependencies set.

Reference: mbirth/ttrss_plugin-af_feedmod#2