unihist - Generate a histogram of the characters in a Unicode file
unihist ([option flags])


unihist generates a histogram of the characters in its input, which must be encoded in UTF-8 Unicode. By default, for each character it prints the frequency of the character as a percentage of the total, the absolute number of tokens in the input, the UTF-32 code in hexadecimal, and, if the character is displayable, the glyph itself as UTF-8 Unicode. Command line flags allow unwanted information to be suppressed. In particular, note that by suppressing the percentages and counts it is possible to generate a list of the unique characters in the input. Output is produced ordered by character code. To sort it in descending order of frequency, pipe the output into the command: sort -k1 -n -r By default, unihist handles all of Unicode. To reduce memory usage and increase speed, it may be compiled so as to handle only the Basic Multilingual Plane (plane 0) by defining BMPONLY.


-c Suppress printing of counts and percentages. -g Suppress printing of glyphs. -h Print usage information. -u Suppress printing of the Unicode code as text. -v Print version information.


uniname (1)


Unicode Standard, version 5.0
Bill Poser billposer@alum.mit.edu
GNU General Public License May, 2008 UNIHIST(1)

