Validate xml in jedit

Enca is an encoding guesser and converter.Perl Encode::Guess (part of the standard distribution) tries successive encodings on a byte string and returns the first encoding in which the string is valid text.They can make mistakes, but they often work in practice as long as you don't deliberately try to fool them.

There are tools that try to guess the encoding of a text file. You can test for valid UTF-8 with isutf8 from moreutils or with iconv -f utf-8 -t utf-8 >/dev/null, amongst others. This is true in particular of UTF-8 most texts in most 8-bit encodings are not valid UTF-8. Some encodings have invalid byte sequences, so it's possible to rule them out for sure. For example, the byte sequence \303\275 ( c3 bd in hexadecimal) could be ý in UTF-8, or Ã½ in latin1, or Ă˝ in latin2, or 羸 in BIG-5, and so on. It isn't always possible to find out for sure what the encoding of a text file is. Here is more information about the file command: Yet, even if it is, it is a very limited one. One might argue that the heuristics of file is some sort of artificial intelligence. But as a computer it would need some sort of artificial intelligence. As humans we might be able to recognize that a file is a text file with some umlauts in a "wrong" encoding. It just sees some bytes and tries to guess what the encoding might be. The file command has no idea of "valid" or "invalid". Umlaut-utf16.txt: Little-endian UTF-16 Unicode text, with no line terminators Umlaut-mixed.txt: application/octet-stream charset=binary $ iconv -f utf8 -t utf16 umlaut-utf8.txt > umlaut-utf16.txtĬheck the hex dump: $ hexdump -C umlaut-iso88591.txtĬreate something "invalid" by mixing all three: $ cat umlaut-iso88591.txt umlaut-utf8.txt umlaut-utf16.txt > umlaut-mixed.txt

But convince yourself: $ hexdump -C umlaut-utf8.txtĬonvert to the other encodings: $ iconv -f utf8 -t iso88591 umlaut-utf8.txt > umlaut-iso88591.txt Here is how I created the files: $ echo ä > umlaut-utf8.txt Umlaut-utf8.txt: text/plain charset=utf-8 Umlaut-utf16.txt: text/plain charset=utf-16le Use the -i parameter to force file to print information about the encoding. The file command makes "best-guesses" about the encoding.