recentpopularlog in

kme : xpath   53

Scrapy - xpath return parent node with content based on regex match - Stack Overflow
Not the solution I was looking for, but has a useful example of how to lower-case a text string (good idea for sorting case-insensitively).

<code class="language-python">def parse(self, response):
for href in response.xpath('//a[contains(translate(@href,"ABCDEFGHIJKLMNOPQRSTUVWXYZ","abcdefghijklmnopqrstuvwxyz"),"keyword")]/@href'):
full_url = response.urljoin(href.extract())
yield { 'url': full_url, }</code>

Note that XPath 2.0 seems to have 'upper-case' and 'lower-case' functions which simplify this process. Also note that (as of this writing, 2019-10-23) XmlStarlet does not support this function.
xml  xsl  xpath  python  scrapy  examplecode  textprocessing 
october 2019 by kme
xml - Using xmllint and xpath with a less-than-perfect HTML document? - Stack Overflow
'xidel' (http://www.skynet.be) seems to handle malformed HTML better than 'xmllint --html' or 'xmlstarlet' even after passing through 'tidy'.
xml  xpath  webscraping  malformedhtml  maybesolution 
may 2019 by kme
Introduction to using XPath in JavaScript - JavaScript | MDN | https://developer.mozilla.org/
<code class="language-javascript">
var xpathResult = document.evaluate( xpathExpression, contextNode, namespaceResolver, resultType, result );
</code>
intro  javascript  xpath  webdevel  reference  newbie 
april 2019 by kme
xpath - How can I match on an attribute that contains a certain string? - Stack Overflow | https://stackoverflow.com/
This naïve approach works, if you can be reasonably assured that one class will not be contained as a substring of another (that you *don't* want to match):

<code class="language-xpath">
//div[contains(@class, 'atag') and contains(@class ,'btag')]
</code>
mjv's answer is a good start but will fail if atag is not the first classname listed.

The usual approach is the rather unwieldy:
<code class="language-xpath">
//*[contains(concat(' ', @class, ' '), ' atag ')]
</code>

this works as long as classes are separated by spaces only, and not other forms of whitespace. This is almost always the case. If it might not be, you have to make it more unwieldy still:
<code class="language-xpath">
//*[contains(concat(' ', normalize-space(@class), ' '), ' atag ')]
</code>

(Selecting by classname-like space-separated strings is such a common case it's surprising there isn't a specific XPath function for it, like CSS3's '[class~="atag"]'.)
webdevel  xpath  css  class  solution 
february 2019 by kme
Combine XPATH predicate with position - Stack Overflow | https://stackoverflow.com/
Solution: use parentheses for grouping, *then* subscript the results. Still not clear why the OP's methods don't work, though.
xpath  webdevel  syntax  solution 
december 2018 by kme
html - XPath Query: get attribute href from a tag - Stack Overflow | https://stackoverflow.com/
For the following HTML document:

<code class="language-html">
<html>
<body>
<a href="http://www.example.com">Example</a>
<a href="http://www.stackoverflow.com">SO</a>
</body>
</html>
</code>

The xpath query /html/body//a/@href (or simply //a/@href) will return:

<code>
http://www.example.com
http://www.stackoverflow.com
</code>
webdevel  xpath  solution 
december 2018 by kme
Xpath cheatsheet | https://devhints.io/
This one is good because it translates between CSS selectors (which you probably already know) to the slightly weirder XPath ones.
essential  xpath  cheatsheet  webdevel  reference  thisone  fuckina 
december 2018 by kme
xpath - Why doesn't xmlstarlet select all nodes? - Stack Overflow | https://stackoverflow.com/
Is this what you need?
<code class="language-bash">xml sel -t -m "//@category" -v "." -o " " books.xml</code>

or to separate the results on each line
<code class="language-bash">xml sel -t -m "//@category" -v "." -n books.xml</code>
xml  xpath  xmlstarlet  webdevel  samplecode  patternmatching 
march 2018 by kme
command line - Why does XMLStarlet return additional whitespace/newline for XML text nodes? - Super User | https://superuser.com/
I thought I got this to work, but in the end I still had to pipe it all through
<code>tr -ds '[:blank:]\r' '' | sed '/^$/d'</code>
<code>xmlstarlet sel -T -t -v "//node/item" file.xml</code>

and it outputs the content of
<code class="language-xml"><node><item>content</item></node></code>

as text without additional whitespace.
xml  xslt  xpath  whitespace  textprocessing  webdevel  almost  solution 
march 2018 by kme
xpath expression to remove whitespace - Stack Overflow | https://stackoverflow.com/
The 'translate' function doesn't seem to work on node lists, though.

Note that you can just use 'xmlstarlet -T -t -v //xpath/expression' if you have XmlStarlet available. Otherwise...
I. Use this single XPath expression:
<code class="language-xpath">translate(normalize-space(/tr/td/a), ' ', '')</code>

Explanation:

normalize-space() produces a new string from its argument, in which any leading or trailing white-space (space, tab, NL or CR characters) is deleted and any intermediary white-space is replaced by a single space character.

translate() takes the result produced by normalize-space() and produces a new string in which each of the remaining intermediary spaces is replaced by the empty string.

II. Alternatively:
<code class="language-xpath">translate(/tr/td/a, ' &#9;&#10;&#13', '')</code>

xml  xslt  xpath  whitespace  textprocessing  webdevel  maybesolution 
march 2018 by kme
html - XPath with optional element in hierarchy - Stack Overflow | https://stackoverflow.com/
My answer was to just use two piped XPath expressions. This seems to work if you have something like 'saxon-lint' that understands it, though:
In XPath 2.0, the optional step can be expressed as (tbody|.).
<code>//table[@id="foo"]/(tbody|.)/tr</code>
xpath  xpath2  patternmatching  xml  webdevel  solution 
march 2018 by kme
html - Testing text() nodes vs string values in XPath - Stack Overflow | https://stackoverflow.com/
XPath text() = is different than XPath . =

(Matching text nodes is different than matching string values)

The following XPaths are not the same...
<code class="language-xpath">//span[text() = 'Office Hours']</code>

Says: Select the span elements that have an immediate child text node equal to 'Office Hours`.

[whereas]
<code class="language-xpath">//span[. = 'Office Hours']</code>

Says: Select the span elements whose string value is equal to 'Office Hours`.
xpath  xml  syntax  textcontent  thisvsthat  patternmatching  newbie  webdevel  reference  explained 
february 2018 by kme
xml - How to use XPath contains() here? - Stack Overflow | https://stackoverflow.com/
Example from the comments

<code class="language-xpath">//ul[@class='featureList' and ./li[contains(.,'Model')]]</code>
xpath  xml  webdevel  reference  patternmatching 
february 2018 by kme
XPath Reference | https://msdn.microsoft.com/
Dunno if this is a good reference or not; it's based on the XPath 1.0 W3C spec.
dotnet  microsoft  msdn  xpath  reference  xpath1 
february 2018 by kme
xml - Get xmllint to output xpath results n-separated, for attribute selector - Stack Overflow | https://stackoverflow.com/
You can try:

<code class="language-bash">
$ xmllint --shell inputfile <<< `echo 'cat /config/*/@*'`
</code>

You might need to grep the output, though, so as to filter the undesired lines.
xml  xmllint  xpath  textprocessing  webdevel  solution 
january 2018 by kme
xpath @ZVON.org | http://zvon.org/
The tutorial here is great, and all the examples work out-of-the-box with 'xmllint --shell'.
xpath  xslt  xml  webdevel  tutorial  reference  solution  deadlink 
january 2018 by kme
xml - Get text content of an HTML element using XPath? - Stack Overflow | https://stackoverflow.com/
You want to select all descendant text, not just child text:

//div[a[contains(., "Add to cart")]]/p//text()
xpath  lxml  webdevel  webscraper  solution 
january 2018 by kme
xml_grep2 - search.cpan.org
This is called "App::Xml_grep2" on CPAN.
-t, --text-only

Return the result as text (using the XPath value of nodes). Results are stripped of newlines and output 1 per line.

Results are in the original encoding for the document.
perl  xml  grep  xpath  html  webdevel  textprocessing 
december 2017 by kme
XML::Twig - A perl module for processing huge XML documents in tree mode. - metacpan.org - https://metacpan.org/
So 'xml_grep --text_only --html' options seems work a treat; these come with the 'xml-twig-tools' package on Debian/Ubuntu.

Quick start:

<code class="language-bash">
> xml_grep -html -t 'a[@class="genu"]' http://stackoverflow.com
> Stack Exchange
</code>

Another useful example, I used when making a GeSHi syntax for ~/.ssh/config:

<code class="language-bash">
curl https://man.openbsd.org/ssh_config | \
xml_grep --html --text_only 'dt[@class="It-tag"]/*/b'
</code>

No matter what I do, I can't get the 'text()' XPath selector to work, even though it's part of the XPath 1.0 standard. That's why '-t' or '--text_only' seems to be necessary, to return the text content of the matched nodes.

The '--html' is also pretty essential, because I think it runs the input through a more liberal HTML parser, produces a valid (X)HTML tree, and then you don't get XML parse errors like do you for just about every page on the Internet, it seems like.

On that note, now I finally see the value of "conformant" XHTML, and I'm wondering: why did XHTML cease to be a thing?

Example:
<code>curl -q https://packages.debian.org/wheezy/keepassx | xml_grep --text_only --html '//*[@id="pdeps"]/ul/li//a'</code>
xpath  xslt  xml  parser  parsing  perl  module  library  commandline  cli  solution  fuckina 
october 2017 by kme
html - ignore malformed XML with Perl-XML - Stack Overflow - https://stackoverflow.com/
xml_grep, a command line tool which comes with XML::Twig, can be used to extract data from HTML using XPath. Normally it works on XML, but you can use the -html option to process HTML (under the hood it uses HTML::TreeBuilder to convert the XML to HTML).

For example:

> xml_grep -html -t 'a[@class="genu"]' http://stackoverflow.com
> Stack Exchange
xpath  xml  parser  textprocessing  webdevel  commandline  cli  maybesolution 
october 2017 by kme
bash - Using xmlstarlet to extract HTML - Stack Overflow - https://stackoverflow.com/
-v means --value-of which is the contents of tags. You should use -c or --copy-of to get the tags themselves.

xmlstarlet sel -t -m "//div[@id='mw-content-text']" -c "." wiki.html
Or just

xmlstarlet sel -t -c "//div[@id='mw-content-text']" wiki.html
xmlstarlet  xml  xpath  parsing  textprocessing  webdevel  solution 
october 2017 by kme
Grep and Sed Equivalent for XML Command Line Processing - Stack Overflow - https://stackoverflow.com/
Accepted answer recommends XMLStarlet, but links to this handy tutorial: https://www.ibm.com/developerworks/library/x-starlet/index.html.

Also:
To Joseph Holsten's excellent list, I add the xpath command-line script which comes with Perl library XML::XPath. A great way to extract information from XML files:

xpath -q -e '/entry[@xml:lang="fr"]' *xml
xml  xpath  xslt  cli  commandline  textprocessing  list  recommendation 
october 2017 by kme
xml - How to execute XPath one-liners from shell? - Stack Overflow - https://stackoverflow.com/
What on earth do all these options do?
<code class="language-bash">xmlstarlet sel -T -t -m '//element/@attribute' -v '.' -n filename.xml</code>
Nokogiri. If I write this wrapper I could call the wrapper in the way described above:

<code class="language-ruby">
#!/usr/bin/ruby

require 'nokogiri'

Nokogiri::XML(STDIN).xpath(ARGV[0]).each do |row|
puts row
end
XML::XPath. Would work with this wrapper:

#!/usr/bin/perl

use strict;
use warnings;
use XML::XPath;

my $root = XML::XPath->new(ioref => 'STDIN');
for my $node ($root->find($ARGV[0])->get_nodelist) {
print($node->getData, "\n");
}
</code>


Also:
<code class="language-bash">
xmllint --xpath '//element/@attribute' file.xml
xmlstarlet sel -t -v "//element/@attribute" file.xml
saxon-lint --xpath '//element/@attribute' file.xml
</code>

In Python with lxml:
So to do the same for normal Web content—HTML docs that aren’t necessarily well-formed XML:
<code class="language-bash">
echo "<p>foo<div>bar</div><p>baz" | python -c "from sys import stdin; \
from lxml import html; \
print '\n'.join(html.tostring(node) for node in html.parse(stdin).xpath('//p'))"
</code>

And to instead use html5lib (to ensure you get the same parsing behavior as Web browsers—because like browser parsers, html5lib conforms to the parsing requirements in the HTML spec).
<code class="language-bash">
echo "<p>foo<div>bar</div><p>baz" | python -c "from sys import stdin; \
import html5lib; from lxml import html; \
doc = html5lib.parse(stdin, treebuilder='lxml', namespaceHTMLElements=False); \
print '\n'.join(html.tostring(node) for node in doc.xpath('//p'))
</code>
xpath  xml  xslt  parsing  ruby  textprocessing  webdevel  commandline  cli  oneliner  list  recommendation  samplecode 
october 2017 by kme
xpath - Pattern Matching with XSLT - Stack Overflow


Regular Expressions are supported only in XSLT 2.x/XPath 2.x.

As at this date, no publicly available browser supports XSLT 2.x/XPath 2.x.

In your concrete case you can use:

starts-with('awesome','awe')

other useful XPath 1.0 functions are:

contains()

substring()

substring-before()

substring-after()

normalize-space()

translate()

string-length()

scraping  screenscraping  webscraping  xpath  xslt  xml  html  patternmatching  regex  solution 
may 2017 by kme
Xidel - HTML/XML/JSON data extraction tool
Easier to use than some of its counterparts:

<code class="language-bash">xidel https://lithub.com/the-ultimate-best-books-of-2018-list/ -e //title</code>
Pronounciation: To say the name "Xidel" in English, you say "excited" with a silent "C" and "D", followed by an "L". In German, you just say it as it is written.

Mirrored (?) at: http://videlibri.sourceforge.net/xidel.html
Source code at: https://github.com/benibela/xidel
FAQ at: https://github.com/benibela/xidel/wiki/Frequently-asked-questions
macOS build instructions: http://bit.ly/34VzheD (evernote.com)
textprocessing  html  xml  json  datamunging  extraction  importexport  cli  commandline  utility  software  xpath  xslt  alternativeto  xmlstarlet 
march 2016 by kme

Copy this bookmark:





to read