8 weeks ago by kme
This package has an open bug for being removed from Debian because it hard-codes the Unicode 5.1 standard in the binary (we're in the 12s now, so this is probably pre-emoji). Proposed alternative from this bug report (https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=930315): http://kassiopeia.juls.savba.sk/~garabik/software/unicode/unix linux unicode textprocessing decode utility sourcecode commandline fuckina solution revealcodes nonprintingcharacters
uniname defaults to printing the character offset of each character, its byte offset, its hex code value, its encoding, the glyph itself, and its name. Command line options allow undesired information to be suppressed and the Unicode range to be added. Other options permit a specified number of bytes or characters to be skipped. For example, the default output for this text:
unidesc reports the character ranges to which different portions of the text belong. It can also be used to identify Unicode encodings (e.g. UTF-16be) flagged by magic numbers. Here is the output when given the above Japanese text as input:
ExplicateUTF8 is intended for debugging or for learning about Unicode. It determines and explains the validity of a sequence of bytes as a UTF8 encoding. Here is the output when given the above Japanese text as input:
Utf8lookup is a shell script which invokes uniname to provide an easy way to look up the character name corresponding to a codepoint from the command line. In addition to uniname it requires the utility Ascii2binary.
Unireverse is a filter that reverses UTF-8 strings character-by-character (as opposed to byte-by-byte). This is useful when dealing with text that is not encoded in the order in which you want to display it or analyze it. For example, if you want to display Arabic on a terminal window that does not support bidi text, Unirev will put it into the normal display order.
Unifuzz generates test input for programs that expect Unicode. It can generate a random string of characters, tokens of various potentially problematic characters and sequences, very long lines, strings with embedded nulls, and ill-formed UTF-8. Use it to find out whether your program reacts gracefully when given unexpected or ill-formed input.
8 weeks ago by kme
Copy this bookmark: