recentpopularlog in

kme : webscraping   16

pup crashes upon launching · Issue #123 · ericchiang/pup
<code class="language-bash">go get github.com/ericchiang/pup</code>
pup  html  parser  webscraping  segfault  errormessage  solution  macos  darwin 
4 weeks ago by kme
xml - Using xmllint and xpath with a less-than-perfect HTML document? - Stack Overflow
'xidel' (http://www.skynet.be) seems to handle malformed HTML better than 'xmllint --html' or 'xmlstarlet' even after passing through 'tidy'.
xml  xpath  webscraping  malformedhtml  maybesolution 
may 2019 by kme
benibela/xidel: A command line tool to download and extract data from HTML/XML pages or JSON-APIs, using CSS, XPath 3.0, XQuery 3.0, JSONiq or pattern templates. It can also create new or transformed XML/HTML/JSON documents.
This tool seems to be able to deal with malformed HTML that 'xmllint' and 'xmlstarlet' choke on (even after a pass through 'tidy').
A command line tool to download and extract data from HTML/XML pages or JSON-APIs, using CSS, XPath 3.0, XQuery 3.0, JSONiq or pattern templates. It can also create new or transformed XML/HTML/JSON documents. - benibela/xidel
xml  html  webscraping  webdevel  api  testing  alternativeto  xmllint  xmlstarlet 
may 2019 by kme
xpath - Pattern Matching with XSLT - Stack Overflow


Regular Expressions are supported only in XSLT 2.x/XPath 2.x.

As at this date, no publicly available browser supports XSLT 2.x/XPath 2.x.

In your concrete case you can use:

starts-with('awesome','awe')

other useful XPath 1.0 functions are:

contains()

substring()

substring-before()

substring-after()

normalize-space()

translate()

string-length()

scraping  screenscraping  webscraping  xpath  xslt  xml  html  patternmatching  regex  solution 
may 2017 by kme
XPath and XSLT with lxml
>>> regexpNS = "http://exslt.org/regular-expressions"
>>> find = etree.XPath("//*[re:test(., '^abc$', 'i')]",
... namespaces={'re':regexpNS})

>>> root = etree.XML("<root><a>aB</a><b>aBc</b></root>")
>>> print(find(root)[0].text)
aBc
python  scraping  webscraping  parser  html  xml  xslt  solution 
may 2017 by kme
How to "log in" to a website using Python's Requests module? - Stack Overflow
Firstly, as Marcus did, check the source of the login form to get three pieces of information - the url that the form posts to, and the name attributes of the username and password fields. In his example, they are inUserName and inUserPass.

Once you've got that, you can use a requests.Session() instance to make a post request to the login url with your login details as a payload. Making requests from a session instance is essentially the same as using requests normally, it simply adds persistence, allowing you to store and use cookies etc.

Assuming your login attempt was successful, you can simply use the session instance to make further requests to the site. The cookie that identifies you will be used to authorise the requests.

Example:
<code class="language-python">
import requests

# Fill in your details here to be posted to the login form.
payload = {
'inUserName': 'username',
'inUserPass': 'password'
}

# Use 'with' to ensure the session context is closed after use.
with requests.Session() as s:
p = s.post('LOGIN_URL', data=payload)
# print the html returned or something more intelligent to see if it's a successful login page.
print p.text

# An authorised request.
r = s.get('A protected web page url')
print r.text
# etc...
</code>
python  webscraping  scraping  requests  webdevel  solution 
may 2017 by kme

Copy this bookmark:





to read