Homepage > Man Pages > Category > General Commands
Homepage > Man Pages > Name > U


man page of ucto

ucto: Unicode Tokenizer


ucto - Unicode Tokenizer


ucto [[options]] [input-file] [[output-file]]


ucto ucto tokenizes text files: it separates words from punctuation, splits sentences (and optionally paragraphs), and finds paired quotes. Ucto is preconfigured with tokenisation rules for several languages.


-c configfile read settings from a file -d value set debug mode to 'value' -e value set input encoding. (default UTF8) -f disable filtering of special characters -L language Automatically selects a configuration file by language code. e.g. 'fr' will select the file tokconfig-fr from the installation directory -l Convert to all lowercase -u Convert to all uppercase -n Assume one sentence per line on input -m Emit one sentence per line on output -P Disable Paragraph Detection -Q Disable Quote Detection -S Disable Sentence Detection -s <string> Set End-of-sentence marker. (Default <utt>) -v Show version information -V set Verbose mode -x <DocId> Output FoLiA XML, use the specified Document ID (this disables usage of most other options: -nulPQVsS)
Maarten van Gompel proycon@anaproy.nl Ko van der Sloot Timbl@uvt.nl 2011 march 14 UCTO(1)

Copyright © 2011–2018 by topics-of-interest.com . All rights reserved. Hosted by all-inkl.
Contact · Imprint · Privacy

Page generated in 42.57ms.

plr.li | Ermitteln Sie Ihre IP-Adresse schnell und einfach | www.daelim-wiki.de