recentpopularlog in

kme : xml   86

« earlier  
xml - how to? xmlstarlet to extract HTML data by id - Stack Overflow
Essential tip for namespaced HTML, otherwise you get... NOTHING out of 'xmlstarlet'

Just passing HTML through 'xml fo -H -R' (process as HTML and recover as much as possible) is enough to get un-namespaced HTML that is also valid XML (source: https://unix.stackexchange.com/a/382928/278323).

The html data has a default namespace that you have to declare in the xmlstarlet command:
<code class="language-bash">
xmlstarlet sel \
-N n="http://www.w3.org/1999/xhtml" \
-t \
-c "/n:html/n:body/n:table[@id='test_table']/descendant::*/text()" \
htmlfile 2>/dev/null
</code>

UPDATE: I didn't know it but as the error message says, there is no need to declare the namespace when it's the default one, so also this works:
<code class="language-bash">
xmlstarlet sel \
-t \
-c "/_:html/_:body/_:table[@id='test_table']/descendant::*/text()" \
htmlfile 2>/dev/null
</code>
xml  xmlstarlet  textprocessing  malformed  html  reference  namespaced  xhtml  solution  fuckina 
13 days ago by kme
Compiling error · Issue #599 · eclipse/mosquitto
You have to install following packages: xsltproc docbook-xsl
Then it will work.

Just 'docbook-xsl' is enough on CentOS, I think. There is no 'xsltproc' package. Also, moreutils' Makefile points to the wrong place for the DocBook stylesheets, so you need to
<code class="language-bash">DOCBOOKXSL='/usr/share/sgml/docbook/xsl-stylesheets' make</code>
errormessage  docbook  xsl  xml  xsltproc  centos  centos7  solution 
11 weeks ago by kme
Scrapy - xpath return parent node with content based on regex match - Stack Overflow
Not the solution I was looking for, but has a useful example of how to lower-case a text string (good idea for sorting case-insensitively).

<code class="language-python">def parse(self, response):
for href in response.xpath('//a[contains(translate(@href,"ABCDEFGHIJKLMNOPQRSTUVWXYZ","abcdefghijklmnopqrstuvwxyz"),"keyword")]/@href'):
full_url = response.urljoin(href.extract())
yield { 'url': full_url, }</code>

Note that XPath 2.0 seems to have 'upper-case' and 'lower-case' functions which simplify this process. Also note that (as of this writing, 2019-10-23) XmlStarlet does not support this function.
xml  xsl  xpath  python  scrapy  examplecode  textprocessing 
october 2019 by kme
shell - Unable to locate and replace an element with its classname using bash script? - Stack Overflow
The actual solution is
Clean it up into valid XHTML with tidy

I used
<code class="language-bash">tidy -f /dev/null -w 0 -n -q -asxhtml</code>
in a pipe to suppress all the extraneous warnings, and get XML that something like XMLStarlet could handle.
html  xhtml  xml  tidy  importexport  datamunging  solution  dammitbrain  fuckina 
september 2019 by kme
Converting Binary Plists - ForensicsWiki
<code class="language-bash">plutil -convert xml1 file.plist</code>
plist  propertylist  forensics  xml  macos  mac  solution 
september 2019 by kme
xml - Using xmllint and xpath with a less-than-perfect HTML document? - Stack Overflow
'xidel' (http://www.skynet.be) seems to handle malformed HTML better than 'xmllint --html' or 'xmlstarlet' even after passing through 'tidy'.
xml  xpath  webscraping  malformedhtml  maybesolution 
may 2019 by kme
benibela/xidel: A command line tool to download and extract data from HTML/XML pages or JSON-APIs, using CSS, XPath 3.0, XQuery 3.0, JSONiq or pattern templates. It can also create new or transformed XML/HTML/JSON documents.
This tool seems to be able to deal with malformed HTML that 'xmllint' and 'xmlstarlet' choke on (even after a pass through 'tidy').
A command line tool to download and extract data from HTML/XML pages or JSON-APIs, using CSS, XPath 3.0, XQuery 3.0, JSONiq or pattern templates. It can also create new or transformed XML/HTML/JSON documents. - benibela/xidel
xml  html  webscraping  webdevel  api  testing  alternativeto  xmllint  xmlstarlet 
may 2019 by kme
python - parsing XML file gets UnicodeEncodeError (ElementTree) / ValueError (lxml) - Stack Overflow | https://stackoverflow.com/
You are using the decoded unicode value. Use r.raw raw response data instead:
<code class="language-python">r = requests.get(url, params=payload, stream=True)
r.raw.decode_content = True
etree.parse(r.raw)</code>

which will read the data from the response directly; do note the stream=True option to .get().

Setting the r.raw.decode_content = True flag ensures that the raw socket will give you the decompressed content even if the response is gzip or deflate compressed.

You don't have to stream the response; for smaller XML documents it is fine to use the response.content attribute, which is the un-decoded response body:
<code class="language-python">r = requests.get(url, params=payload)
xml = etree.fromstring(r.content)</code>

XML parsers always expect bytes as input as the XML format itself dictates how the parser is to decode those bytes to Unicode text.
xml  requests  elementtree  lxml  python  webdevel  errormessage  solution 
july 2018 by kme
xpath - Why doesn't xmlstarlet select all nodes? - Stack Overflow | https://stackoverflow.com/
Is this what you need?
<code class="language-bash">xml sel -t -m "//@category" -v "." -o " " books.xml</code>

or to separate the results on each line
<code class="language-bash">xml sel -t -m "//@category" -v "." -n books.xml</code>
xml  xpath  xmlstarlet  webdevel  samplecode  patternmatching 
march 2018 by kme
command line - Why does XMLStarlet return additional whitespace/newline for XML text nodes? - Super User | https://superuser.com/
I thought I got this to work, but in the end I still had to pipe it all through
<code>tr -ds '[:blank:]\r' '' | sed '/^$/d'</code>
<code>xmlstarlet sel -T -t -v "//node/item" file.xml</code>

and it outputs the content of
<code class="language-xml"><node><item>content</item></node></code>

as text without additional whitespace.
xml  xslt  xpath  whitespace  textprocessing  webdevel  almost  solution 
march 2018 by kme
xpath expression to remove whitespace - Stack Overflow | https://stackoverflow.com/
The 'translate' function doesn't seem to work on node lists, though.

Note that you can just use 'xmlstarlet -T -t -v //xpath/expression' if you have XmlStarlet available. Otherwise...
I. Use this single XPath expression:
<code class="language-xpath">translate(normalize-space(/tr/td/a), ' ', '')</code>

Explanation:

normalize-space() produces a new string from its argument, in which any leading or trailing white-space (space, tab, NL or CR characters) is deleted and any intermediary white-space is replaced by a single space character.

translate() takes the result produced by normalize-space() and produces a new string in which each of the remaining intermediary spaces is replaced by the empty string.

II. Alternatively:
<code class="language-xpath">translate(/tr/td/a, ' &#9;&#10;&#13', '')</code>

xml  xslt  xpath  whitespace  textprocessing  webdevel  maybesolution 
march 2018 by kme
html - XPath with optional element in hierarchy - Stack Overflow | https://stackoverflow.com/
My answer was to just use two piped XPath expressions. This seems to work if you have something like 'saxon-lint' that understands it, though:
In XPath 2.0, the optional step can be expressed as (tbody|.).
<code>//table[@id="foo"]/(tbody|.)/tr</code>
xpath  xpath2  patternmatching  xml  webdevel  solution 
march 2018 by kme
html - Testing text() nodes vs string values in XPath - Stack Overflow | https://stackoverflow.com/
XPath text() = is different than XPath . =

(Matching text nodes is different than matching string values)

The following XPaths are not the same...
<code class="language-xpath">//span[text() = 'Office Hours']</code>

Says: Select the span elements that have an immediate child text node equal to 'Office Hours`.

[whereas]
<code class="language-xpath">//span[. = 'Office Hours']</code>

Says: Select the span elements whose string value is equal to 'Office Hours`.
xpath  xml  syntax  textcontent  thisvsthat  patternmatching  newbie  webdevel  reference  explained 
february 2018 by kme
xml - How to use XPath contains() here? - Stack Overflow | https://stackoverflow.com/
Example from the comments

<code class="language-xpath">//ul[@class='featureList' and ./li[contains(.,'Model')]]</code>
xpath  xml  webdevel  reference  patternmatching 
february 2018 by kme
xml - Get xmllint to output xpath results n-separated, for attribute selector - Stack Overflow | https://stackoverflow.com/
You can try:

<code class="language-bash">
$ xmllint --shell inputfile <<< `echo 'cat /config/*/@*'`
</code>

You might need to grep the output, though, so as to filter the undesired lines.
xml  xmllint  xpath  textprocessing  webdevel  solution 
january 2018 by kme
xpath @ZVON.org | http://zvon.org/
The tutorial here is great, and all the examples work out-of-the-box with 'xmllint --shell'.
xpath  xslt  xml  webdevel  tutorial  reference  solution  deadlink 
january 2018 by kme
xml - Access attribute specified by XPath with xml_grep - Stack Overflow | https://stackoverflow.com/
I suggest using the xpath program instead:
<code>
$ xpath x '//entry/@path'
Found 1 nodes:
-- NODE --
path="trunk"
</code>

This program should come bundled with XML::Xpath.
xml  commandline  cli  textprocessing 
january 2018 by kme
xml_grep2 - search.cpan.org
This is called "App::Xml_grep2" on CPAN.
-t, --text-only

Return the result as text (using the XPath value of nodes). Results are stripped of newlines and output 1 per line.

Results are in the original encoding for the document.
perl  xml  grep  xpath  html  webdevel  textprocessing 
december 2017 by kme
XML::Twig - A perl module for processing huge XML documents in tree mode. - metacpan.org - https://metacpan.org/
So 'xml_grep --text_only --html' options seems work a treat; these come with the 'xml-twig-tools' package on Debian/Ubuntu.

Quick start:

<code class="language-bash">
> xml_grep -html -t 'a[@class="genu"]' http://stackoverflow.com
> Stack Exchange
</code>

Another useful example, I used when making a GeSHi syntax for ~/.ssh/config:

<code class="language-bash">
curl https://man.openbsd.org/ssh_config | \
xml_grep --html --text_only 'dt[@class="It-tag"]/*/b'
</code>

No matter what I do, I can't get the 'text()' XPath selector to work, even though it's part of the XPath 1.0 standard. That's why '-t' or '--text_only' seems to be necessary, to return the text content of the matched nodes.

The '--html' is also pretty essential, because I think it runs the input through a more liberal HTML parser, produces a valid (X)HTML tree, and then you don't get XML parse errors like do you for just about every page on the Internet, it seems like.

On that note, now I finally see the value of "conformant" XHTML, and I'm wondering: why did XHTML cease to be a thing?

Example:
<code>curl -q https://packages.debian.org/wheezy/keepassx | xml_grep --text_only --html '//*[@id="pdeps"]/ul/li//a'</code>
xpath  xslt  xml  parser  parsing  perl  module  library  commandline  cli  solution  fuckina 
october 2017 by kme
html - ignore malformed XML with Perl-XML - Stack Overflow - https://stackoverflow.com/
xml_grep, a command line tool which comes with XML::Twig, can be used to extract data from HTML using XPath. Normally it works on XML, but you can use the -html option to process HTML (under the hood it uses HTML::TreeBuilder to convert the XML to HTML).

For example:

> xml_grep -html -t 'a[@class="genu"]' http://stackoverflow.com
> Stack Exchange
xpath  xml  parser  textprocessing  webdevel  commandline  cli  maybesolution 
october 2017 by kme
bash - Using xmlstarlet to extract HTML - Stack Overflow - https://stackoverflow.com/
-v means --value-of which is the contents of tags. You should use -c or --copy-of to get the tags themselves.

xmlstarlet sel -t -m "//div[@id='mw-content-text']" -c "." wiki.html
Or just

xmlstarlet sel -t -c "//div[@id='mw-content-text']" wiki.html
xmlstarlet  xml  xpath  parsing  textprocessing  webdevel  solution 
october 2017 by kme
Start working with XMLStarlet - https://www.ibm.com/
Learn how to use the XMLStarlet command-line utility to format, transform, fix, and edit XML using a set of simple commands. Jack Herrington shows you how easy it is to get up and running -- and simplify your life -- with this powerful tool.
xml  xslt  parsing  webdevel  xmlstarlet  tutorial 
october 2017 by kme
Grep and Sed Equivalent for XML Command Line Processing - Stack Overflow - https://stackoverflow.com/
Accepted answer recommends XMLStarlet, but links to this handy tutorial: https://www.ibm.com/developerworks/library/x-starlet/index.html.

Also:
To Joseph Holsten's excellent list, I add the xpath command-line script which comes with Perl library XML::XPath. A great way to extract information from XML files:

xpath -q -e '/entry[@xml:lang="fr"]' *xml
xml  xpath  xslt  cli  commandline  textprocessing  list  recommendation 
october 2017 by kme
xml - How to execute XPath one-liners from shell? - Stack Overflow - https://stackoverflow.com/
What on earth do all these options do?
<code class="language-bash">xmlstarlet sel -T -t -m '//element/@attribute' -v '.' -n filename.xml</code>
Nokogiri. If I write this wrapper I could call the wrapper in the way described above:

<code class="language-ruby">
#!/usr/bin/ruby

require 'nokogiri'

Nokogiri::XML(STDIN).xpath(ARGV[0]).each do |row|
puts row
end
XML::XPath. Would work with this wrapper:

#!/usr/bin/perl

use strict;
use warnings;
use XML::XPath;

my $root = XML::XPath->new(ioref => 'STDIN');
for my $node ($root->find($ARGV[0])->get_nodelist) {
print($node->getData, "\n");
}
</code>


Also:
<code class="language-bash">
xmllint --xpath '//element/@attribute' file.xml
xmlstarlet sel -t -v "//element/@attribute" file.xml
saxon-lint --xpath '//element/@attribute' file.xml
</code>

In Python with lxml:
So to do the same for normal Web content—HTML docs that aren’t necessarily well-formed XML:
<code class="language-bash">
echo "<p>foo<div>bar</div><p>baz" | python -c "from sys import stdin; \
from lxml import html; \
print '\n'.join(html.tostring(node) for node in html.parse(stdin).xpath('//p'))"
</code>

And to instead use html5lib (to ensure you get the same parsing behavior as Web browsers—because like browser parsers, html5lib conforms to the parsing requirements in the HTML spec).
<code class="language-bash">
echo "<p>foo<div>bar</div><p>baz" | python -c "from sys import stdin; \
import html5lib; from lxml import html; \
doc = html5lib.parse(stdin, treebuilder='lxml', namespaceHTMLElements=False); \
print '\n'.join(html.tostring(node) for node in doc.xpath('//p'))
</code>
xpath  xml  xslt  parsing  ruby  textprocessing  webdevel  commandline  cli  oneliner  list  recommendation  samplecode 
october 2017 by kme
html - RegEx match open tags except XHTML self-contained tags - Stack Overflow
I think the flaw here is that HTML is a Chomsky Type 2 grammar (context free grammar) and RegEx is a Chomsky Type 3 grammar (regular grammar). Since a Type 2 grammar is fundamentally more complex than a Type 3 grammar (see the Chomsky hierarchy), it is mathematically impossible to parse XML with RegEx.

But many will try, some will even claim success - but until others find the fault and totally mess you up.
textprocessing  webdevel  xml  parser  parsing  regex  funny  html  zalgowillcomeforyou 
june 2017 by kme
python - error: command 'gcc' failed with exit status 1 on CentOS - Stack Overflow
I bet you have to install libxml2-devel or libxml++-devel or even python-devel.


So this would've worked:
$ sudo yum -y install gcc gcc-c++ kernel-devel
$ sudo yum -y install python-devel libxslt-devel libffi-devel openssl-devel
$ pip install "your python packet"
python  xml  library  errormessage  missinglibraries  centos6  centos  solution 
may 2017 by kme
xpath - Pattern Matching with XSLT - Stack Overflow


Regular Expressions are supported only in XSLT 2.x/XPath 2.x.

As at this date, no publicly available browser supports XSLT 2.x/XPath 2.x.

In your concrete case you can use:

starts-with('awesome','awe')

other useful XPath 1.0 functions are:

contains()

substring()

substring-before()

substring-after()

normalize-space()

translate()

string-length()

scraping  screenscraping  webscraping  xpath  xslt  xml  html  patternmatching  regex  solution 
may 2017 by kme
XPath and XSLT with lxml
>>> regexpNS = "http://exslt.org/regular-expressions"
>>> find = etree.XPath("//*[re:test(., '^abc$', 'i')]",
... namespaces={'re':regexpNS})

>>> root = etree.XML("<root><a>aB</a><b>aBc</b></root>")
>>> print(find(root)[0].text)
aBc
python  scraping  webscraping  parser  html  xml  xslt  solution 
may 2017 by kme
bash XHTML parsing using xpath - Stack Overflow
I think maybe 'xpath' comes with the Perl XML::Parser library. Not sure.
shell  commandline  bash  shellscripting  xml  parser 
april 2017 by kme
How do I parse XML in Python? - Stack Overflow
lxml's objectify seems to do what I want it to do, for a well-formed XML file like a KeePass XML export.
from lxml import objectify
from collections import defaultdict

count = defaultdict(int)

root = objectify.fromstring(text)

for item in root.bar.type:
count[item.attrib.get("foobar")] += 1

print dict(count)
xml  python  parser  howto  newbie  solution 
april 2017 by kme
Xidel - HTML/XML/JSON data extraction tool
Easier to use than some of its counterparts:

<code class="language-bash">xidel https://lithub.com/the-ultimate-best-books-of-2018-list/ -e //title</code>
Pronounciation: To say the name "Xidel" in English, you say "excited" with a silent "C" and "D", followed by an "L". In German, you just say it as it is written.

Mirrored (?) at: http://videlibri.sourceforge.net/xidel.html
Source code at: https://github.com/benibela/xidel
FAQ at: https://github.com/benibela/xidel/wiki/Frequently-asked-questions
macOS build instructions: http://bit.ly/34VzheD (evernote.com)
textprocessing  html  xml  json  datamunging  extraction  importexport  cli  commandline  utility  software  xpath  xslt  alternativeto  xmlstarlet 
march 2016 by kme
stroke-linejoin - SVG | MDN
stroke-linejoin="round"


Inkscape makes this a property in "style=". Neither one seems to work in the browser for paths.
xml  svg  webdevel  webdesign  maybesolution 
july 2014 by kme
html - Linking to CSS in an SVG embedded by an IMG tag - Stack Overflow
For security reasons images must be standalone files. You can use CSS if you encode the stylesheet as a data uri. E.g.

<?xml version="1.0" standalone="no"?>
<!DOCTYPE svg PUBLIC "-//W3C//DTD SVG 1.1//EN"
"http://www.w3.org/Graphics/SVG/1.1/DTD/svg11.dtd">
<?xml-stylesheet type="text/css" href="data:text/css;charset=utf-8;base64,cmVjdCB7IA0KICAgIHN0cm9rZTogYmxhY2s7DQogICAgZmlsbDogZ3JlZW47DQp9" ?>
<svg version="1.1"
xmlns="http://www.w3.org/2000/svg"
viewBox="0 0 100 100"
>
<rect x="10" y="10" width="80" height="80" />
</svg>

There are various online converters for data URIs.
xml  svg  webdevel  embed  webdesign  graphics  solution 
july 2014 by kme
Displaying an SVG image coherently in all decent Browsers
The final result that did the trick was the following:

Make sure the SVG only specifies a viewport, but no width and height, make sure the aspect is preserved like so:

<svg version="1.1" xmlns="http://www.w3.org/2000/svg"
xmlns:xlink="http://www.w3.org/1999/xlink"
viewBox="0 0 1024 768"
preserveAspectRatio="xMidYMid meet">

Use the <object> tag to embed the image (not <img>), set the width and height of the object to 100%:

<object height="100%" width="100%"
data="/path/to/your/image.svg" type="image/svg+xml">
<!-- put content for old browsers or IE here, it
will get displayed instead of the SVG. You can
embed a raster version of the image here, but remember it
will get loaded by the SVG capable browsers as well,
even if they don't display it! That may slow your
page down significantly. -->
</object>

Wrap the object in a div with which you control the actual size. You should be able to use absolute & relative sizes. I use em like so:

<div style="width:10em;height:5em">
<object ....
</object>
</div>
xml  svg  webdevel  browserquirks  solution 
july 2014 by kme
data: URI Generator
Was trying to embed a Google font CSS in an SVG image and I think this worked.
xml  css  datauri  images  generator  webapp  solution 
july 2014 by kme
Mozilla SVG Project Frequently Asked Questions
This is an XML debugging message to help XML authors correct errors in their XML documents. Mozilla will show this message when there's an XML well formedness error in the file it tried to load. (It doesn't mean there's an error in Mozilla.) There are many different XML errors, but the most common one in SVG files is "XML Parsing Error: prefix not bound to a namespace". This is (almost certainly) because the 'xmlns:xlink' attribute has been used in the file without including the following two namespace bindings on the root <svg> tag.

<svg xmlns="http://www.w3.org/2000/svg"
xmlns:xlink="http://www.w3.org/1999/xlink">
xml  svg  validation  errormessage  solution 
july 2014 by kme
hpricot/hpricot · GitHub
obsolescent: use nokogiri instead
hpricot  xml  html  parser  library  ruby  webdevel 
may 2013 by kme
« earlier      
per page:    204080120160

Copy this bookmark:





to read