recentpopularlog in

kme : textprocessing   347

« earlier  
Paste to Markdown
Uses Turndown (JavaScript library). Good for Mac where 'pbpaste' can't be coerced into outputting raw HTML in a pipe to, say, 'pandoc'.
html  rtf  markdown  conversion  textprocessing  webapp  utility  solution 
5 weeks ago by kme
Useful Unix commands for data science
Imagine you have a 4.2GB CSV file. It has over 12 million records and 50 columns. All you need from this file is the sum of all values in one particular column.

OK, but I'd mention the useless use of 'cat' to anyone learning from this guide. Alternatives:
<code class="language-bash">
<data.csv awk -F "|" '{ sum += $4 } END { printf "%.2f\n", sum }'
awk -F "|" '{ sum += $4 } END { printf "%.2f\n", sum }' data.csv
unix  textprocessing  datascience  commandline  reference  newbie 
7 weeks ago by kme
text processing - how to massage or format html in order to parse with xmstarlet? - Unix & Linux Stack Exchange
Pretty key when the input is HTML but not XHTML:
<code class="language-bash">xmlstarlet fo -H -R </code>
xmlstarlet  malformed  html  webdevel  textprocessing  commandline  cli  solution 
8 weeks ago by kme
xml - how to? xmlstarlet to extract HTML data by id - Stack Overflow
Essential tip for namespaced HTML, otherwise you get... NOTHING out of 'xmlstarlet'

Just passing HTML through 'xml fo -H -R' (process as HTML and recover as much as possible) is enough to get un-namespaced HTML that is also valid XML (source:

The html data has a default namespace that you have to declare in the xmlstarlet command:
<code class="language-bash">
xmlstarlet sel \
-N n="" \
-t \
-c "/n:html/n:body/n:table[@id='test_table']/descendant::*/text()" \
htmlfile 2>/dev/null

UPDATE: I didn't know it but as the error message says, there is no need to declare the namespace when it's the default one, so also this works:
<code class="language-bash">
xmlstarlet sel \
-t \
-c "/_:html/_:body/_:table[@id='test_table']/descendant::*/text()" \
htmlfile 2>/dev/null
xml  xmlstarlet  textprocessing  malformed  html  reference  namespaced  xhtml  solution  fuckina 
8 weeks ago by kme - print a file slowly
This works, whereas 'slowcat.c' didn't for some reason (on macOS).
unix  linux  terminal  textprocessing  utility  software 
9 weeks ago by kme
How to add separator in powershell - Stack Overflow
Basically, use 'export-csv' or 'converto-csv' with a '-delimiter' option.
csv  windows  powershell  shellscripting  textprocessing  newbie  solution 
january 2020 by kme
How do I concatenate strings and variables in PowerShell? - Stack Overflow
This was my solution:

<code class="lang-powershell">gci -recurse -include *.exe,*.msi |
%{ get-filehash $_ -a SHA1 } |
%{ write-host $(
$_.hash + " " +
$(Resolve-Path -Relative $_.path | %{$_ -split "\\" -join "/"})
powershell  stringmanipulation  textprocessing  shellscripting  syntax  solution 
january 2020 by kme
In Powershell, how do I concatenate one property of an array of objects into a string? - Stack Overflow |
What I was actually looking for was like Awk with a specific RS. I ended up using `write-host` in a subexpression, something like this:

<code class="language-powershell">
gci -recurse -include *.exe,*.msi |
%{ get-filehash $_ -a SHA1 } |
%{ write-host $(
$_.hash + " " +
$(Resolve-Path -Relative $_.path | %{$_ -split "\\" -join "/"})

I think either 'convertto-csv' or 'export-csv' with '-delimiter' might work, though.

Note that 'get-filehash' doesn't seem to work on Windows 7 (as shipped, even with all current updates). It was introduced in PowerShell 4.0, and who knows what OS got that first (8, 10?).
windows  shellscripting  powershell  stringmanipulation  textprocessing  sortof  solution 
january 2020 by kme
PowerShell command to split from output - Stack Overflow |
<code class="language-powershell">
Get-Content Carlist.txt | select-string carId | Foreach-Object { $_.ToString().split(':')[1] -replace '\s','' }
powershell  windows  textprocessing  shellscripting  solution 
january 2020 by kme
encoding - Using PowerShell to write a file in UTF-8 without the BOM - Stack Overflow |
<code class="language-powershell">
Get-Content path/to/file.ext | out-file -encoding ASCII targetFile.ext
utf16  utf8  bom  utf  unicode  powershell  textprocessing  solution 
january 2020 by kme
Unicode Utilities
This package has an open bug for being removed from Debian because it hard-codes the Unicode 5.1 standard in the binary (we're in the 12s now, so this is probably pre-emoji). Proposed alternative from this bug report (
uniname defaults to printing the character offset of each character, its byte offset, its hex code value, its encoding, the glyph itself, and its name. Command line options allow undesired information to be suppressed and the Unicode range to be added. Other options permit a specified number of bytes or characters to be skipped. For example, the default output for this text:

unidesc reports the character ranges to which different portions of the text belong. It can also be used to identify Unicode encodings (e.g. UTF-16be) flagged by magic numbers. Here is the output when given the above Japanese text as input:

ExplicateUTF8 is intended for debugging or for learning about Unicode. It determines and explains the validity of a sequence of bytes as a UTF8 encoding. Here is the output when given the above Japanese text as input:

Utf8lookup is a shell script which invokes uniname to provide an easy way to look up the character name corresponding to a codepoint from the command line. In addition to uniname it requires the utility Ascii2binary.

Unireverse is a filter that reverses UTF-8 strings character-by-character (as opposed to byte-by-byte). This is useful when dealing with text that is not encoded in the order in which you want to display it or analyze it. For example, if you want to display Arabic on a terminal window that does not support bidi text, Unirev will put it into the normal display order.

Unifuzz generates test input for programs that expect Unicode. It can generate a random string of characters, tokens of various potentially problematic characters and sequences, very long lines, strings with embedded nulls, and ill-formed UTF-8. Use it to find out whether your program reacts gracefully when given unexpected or ill-formed input.
unix  linux  unicode  textprocessing  decode  utility  sourcecode  commandline  fuckina  solution  revealcodes  nonprintingcharacters 
december 2019 by kme
Ubuntu Manpage: uniname - Name the characters in a Unicode text file
From the 'uniutils' package, apparently (via:
uniname names the characters in a Unicode text file. For each character, uniname defaults to printing the character offset, the byte offset, the hexadecimal UTF-32 character code, the encoding as a sequence of hex byte values, the glyph, and the character's Unicode name. Command line flags allow undesired information to be suppressed. Glyphs that do not display nicely, such as control characters and spaces, are not displayed. For the Latin-1 control characters, whose official Unicode name is "control", the real name is given. Character and byte offsets both start from 0.
linux  unicode  decode  textprocessing  fuckina  solution 
december 2019 by kme
How do I stop sed from adding extra newline characters - Unix & Linux Stack Exchange
GNU sed, if it's available, will not print an extraneous newline if you do 'sed -n "s/patt/repl/p"' (source:
If you do need that file not to end in a newline character, then you could use perl or other tools that can cope with non-text data.
<code class="language-bash">perl -pe 's|<LIST_G_STATEMENT>|$&\n|g'</code>
perl  sed  newlines  textprocessing  unix  annoyance  solution 
december 2019 by kme
Scrapy - xpath return parent node with content based on regex match - Stack Overflow
Not the solution I was looking for, but has a useful example of how to lower-case a text string (good idea for sorting case-insensitively).

<code class="language-python">def parse(self, response):
for href in response.xpath('//a[contains(translate(@href,"ABCDEFGHIJKLMNOPQRSTUVWXYZ","abcdefghijklmnopqrstuvwxyz"),"keyword")]/@href'):
full_url = response.urljoin(href.extract())
yield { 'url': full_url, }</code>

Note that XPath 2.0 seems to have 'upper-case' and 'lower-case' functions which simplify this process. Also note that (as of this writing, 2019-10-23) XmlStarlet does not support this function.
xml  xsl  xpath  python  scrapy  examplecode  textprocessing 
october 2019 by kme
Use Vim Inside A Unix Pipe Like Sed Or AWK
<code class="language-bash">
# source:
function vimify() {
(vim - -esbnN -c $@ -c 'w!/dev/fd/3|q!' >/dev/null) 3>&1
vim  pipe  textprocessing  solution 
october 2019 by kme
GitHub - gfontenot/reflow: Intelligently reflow plain text |
Intelligently reflow plain text

It's just okay. The 'stack' build system is pretty nifty, though ('stack install' puts the binary in ~/.local/bin by default, which I like).
commandline  haskell  textprocessing  reflow  reformatting  markdown  alternativeto  fold  fmt  par 
july 2019 by kme
newlines - What's the point in adding a new line to the end of a file? - Unix & Linux Stack Exchange
Example: the output of GNU sort always ends with a newline. So if the file foo is missing its final newline, you'll find that sort foo | wc -c reports one more character than cat foo | wc -c.

Not necessarily the reason, but a practical consequence of files not ending with a new line:

Consider what would happen if you wanted to process several files using cat. For instance, if you wanted to find the word foo at the start of the line across 3 files:
<code class="language-bash">cat file1 file2 file3 | grep -e '^foo'</code>
newlineterminator  unix  textfiles  textprocessing  explained 
july 2019 by kme
GitHub - tomnomnom/gron: Make JSON greppable!
gron - Make JSON greppable!

gron transforms JSON into discrete assignments to make it easier to grep for what you want and see the absolute 'path' to it. It eases the exploration of APIs that return large blobs of JSON but have terrible documentation.
go  json  textprocessing  grep  patternmatching  webdevel  debugging  troubleshooting  commandline  cli 
july 2019 by kme
GitHub - tomnomnom/gf: A wrapper around grep, to help you grep for things
gf - A wrapper around grep to avoid typing common patterns.

What? Why?

I use grep a lot. When auditing code bases, looking at the output of meg, or just generally dealing with large amounts of data. I often end up using fairly complex patterns like this one:

▶ grep -HnrE '(\$_(POST|GET|COOKIE|REQUEST|SERVER|FILES)|php://(input|stdin))' *

It's really easy to mess up when typing all of that, and it can be hard to know if you haven't got any results because there are non to find, or because you screwed up writing the pattern or chose the wrong flags.

I wrote gf to give names to the pattern and flag combinations I use all the time. So the above command becomes simply:

▶ gf php-sources
go  cli  commandline  searching  textprocessing  patternmatching  regex  grep  alternativeto  ack  ag 
july 2019 by kme
Perl Is Still The Goddess For Text Manipulation – Towards Data Science
<code class="language-perl"># insert line numbers
perl -i -ne 'printf "%04d %s", $., $_'</code>

# print columns 1-5
perl -F"\t" -nlae'print join "\t", @F[0..4]'

# print all lines from line 18 to line 29
perl -ne 'print if $. >=18 && $. <=29'

# print all lines between two patterns
perl -ne 'print if (/^START/../^END/);'</code>
perl  textprocessing  oneliners  tipsandtricks  likeawk 
june 2019 by kme
GitHub - alex/csv-sql: Query your CSV files with SQL
Query your CSV files with SQL. Contribute to alex/csv-sql development by creating an account on GitHub.
csv  sql  datamining  commandline  rust  textprocessing 
june 2019 by kme
bash - Pad one-digit, two-digit and three-digit numbers with zeros with sed - Stack Overflow
<code class="language-bash">$ sed -E 's/([[:digit:]]+)/000&/g;s/0+([[:digit:]]{4})/\1/g' file.txt</code>
sed  textprocessing  shellscripting  solution 
may 2019 by kme
BurntSushi/xsv: A fast CSV command line toolkit written in Rust. -
A fast CSV command line toolkit written in Rust. Contribute to BurntSushi/xsv development by creating an account on GitHub.
rust  csv  textprocessing  commandline 
april 2019 by kme
command line - Install shuf on OS X? - Ask Different
You can install coreutils with brew install coreutils.

shuf will be linked as gshuf. Read the caveats when you install coreutils.
coreutils  mac  osx  macos  macports  textprocessing  solution  likelinux 
april 2019 by kme
pipe - Piping buffer to external command in Vim - Stack Overflow
You can use :w !cmd to write the current buffer to the stdin of an external command. From :help :w_c:
vim  buffers  pipes  externalcommands  textfilter  textprocessing  dammitbrain  solution 
april 2019 by kme
use of alternation "|" in sed's regex - Super User |
<code style="language-bash">echo "blia blib bou blf" | sed 's/bl\(ia\|f\)//g'</code>
For anyone else confused by this answer \| only works in gnu sed (gsed on os x) not vanilla sed (sed on os x). – Andrew Hancox Apr 4 '12 at 14:54
sed  shellscripting  textprocessing  syntax  newbie  dammitbrain  linuxonly  solution 
march 2019 by kme
linux - How to use sed to remove the last n lines of a file - Stack Overflow |
Yeah, it's possible in 'sed', but ugly.

<code class="language-bash">head -n -2 myfile.txt</code>
bash  linux  sed  shellscripting  textprocessing  solution 
march 2019 by kme
bash - How to add a carriage return with sed? - Stack Overflow |
In this case, this guy was right.
sed is for simple subsitutions on individual lines, that is all. For anything else you should be using awk:
shellscripting  sed  awk  textprocessing  whitespace  sortof  solution 
march 2019 by kme
makefile - Joining elements of a list in GNU Make - Stack Overflow |

You can use the $(subst) command, combined with a little trick to get a variable that has a value of a single space:
<code class="language-makefile">
p = /usr /usr/share /lib
space = $(noop) $(noop)

@echo $(subst $(space),:,$(p))
devel  make  makefile  pathvariable  pathlikevariable  textprocessing  solution 
february 2019 by kme
GNU Parallel 20110205 - The FOSDEM Release - YouTube |
Using '--pipe'

Each block of 3 is sorted, but whole output is not sorted:
<code class="language-bash">
# when used with --pipe, '-N' is the number of records to read

seq 1 10 | shuf | parallel --pipe -N 3 sort -n
# result:
# 1
# 3
# 10
# 2
# 4
# 6

Sort each group of records individually, output to files w/ '--files':
<code class="language-bash">
seq 1 10 \
| shuf \
| parallel --pipe --files -N 3 sort -n
# result:
# /tmp/sNGZtP6Kr8.par
# /tmp/q44fQdincg.par
# /tmp/CSVAPt5Ybe.par

Pipe through another parallel to merge pre-sorted files & clean up temps using '-mj1' is basically like 'xargs' (all arguments on one command line); with recent versions, may want to use '-X' instead.
<code class="language-bash">
seq 1 10 \
| shuf \
| parallel --pipe -N 3 sort -n \
| parallel -mj1 sort -nm {} ';' rm {}
gnuparallel  parallelism  sysadmin  unix  textprocessing  tipsandtricks  video 
february 2019 by kme
How to Copy Command Line Output to the Windows Clipboard |
So there's a 'clip' utility on Windows that works like 'pbcopy' or 'xclip' on other OSes, but *only* for input.
windows  textprocessing  clipboard  commandline  scripting  sortof  solution  cli 
february 2019 by kme
A stupid proposal: fill stackexchange with Miller · Issue #212 · johnkerl/miller |
Miller is like awk, sed, cut, join, and sort for name-indexed data such as CSV, TSV, and tabular JSON - johnkerl/miller
textprocessing  importexport  conversion  delimitedtext  tabdelimited  csv  miller  sampleusage  needshelp 
january 2019 by kme
Is there an alternative to sed that supports unicode? - Unix & Linux Stack Exchange |
Just use that syntax:

sed 's/馑//g' file1

Or in the escaped form:

sed "s/$(echo -ne '\u9991')//g" file1

(Note that older versions of Bash and some shells do not understand echo -e '\u9991', so check first.)
sed  unicode  textprocessing  solution 
january 2019 by kme
Question: why does PEP8 recommend leaving a blank line at the end of a .py file? : Python |
I still cannot find where in PEP 8 it says this, though.
The newline character is considered a line terminator, not a line delimiter.

This is right, but I think still somewhat confusing to some people. The point is that every line should end with a newline, because as was pointed out the newline is considered a line terminator.

I think the proper way to think of it is not to think of it as a blank line at the end of the file - that blank line appears in text editors, but if you were to open the file in python or most programming languages and do readlines, you wouldn't see any blank line at the end. The newline would be the last character in the last line which would be the line before the apparent blank line.
pep8  codestyle  linter  newline  textprocessing  explained 
december 2018 by kme
How original is Miller? |
No, really, why one more command-line data-manipulation tool? I wrote Miller because I was frustrated with tools like grep, sed, and so on being line-aware without being format-aware. The single most poignant example I can think of is seeing people grep data lines out of their CSV files and sadly losing their header lines. While some lighter-than-SQL processing is very nice to have, at core I wanted the format-awareness of RecordStream combined with the raw speed of the Unix toolkit. Miller does precisely that.
butwhy  unix  commandline  textprocessing  onethingwell  philosophy  explained 
december 2018 by kme
How can I tell if STDIN is connected to a terminal in Perl? - Stack Overflow |
<code class="language-perl">if (-t STDIN) {
# stdin is connected
} else {
# stdin is not connected
perl  devel  stdin  textprocessing  pipes  solution 
november 2018 by kme
macos - Case-insensitive ls sorting in Mac OSX - Ask Different |
I needed sort's '-V' (version string sort) option, and here's how I got that:
Install the GNU Coreutils package:

sudo port install coreutils
sorting  textprocessing  bsd  mac  osx  shellscripting  workaround  solution 
november 2018 by kme
loop over characters in input string using awk - Stack Overflow |
You can convert a string to an array using split:
<code class="language-bash">
echo "here is a string" | awk '
split($0, chars, "")
for (i=1; i <= length($0); i++) {
printf("%s\n", chars[i])

This prints the characters vertically, one per line.
awk  arrays  textprocessing  shellscripting  solution 
november 2018 by kme
« earlier      
per page:    204080120160

Copy this bookmark:

to read