Skip to content Skip to sidebar Skip to footer

How To Detect Is Text Human Readable?

I am wondering if there's a way to tell a given text is human readable. By human readable, I mean: it has some meanings, format like an article written by somebody, or at least gen

Solution 1:

You can try a language identification tool, or something similar.

Basically you have to count the characters, or groups of character (character n-grams), and compare the distribution of the letters of the text submitted with the distribution of the letters of a collection of texts written in good english. (Make sure that such collection of texts is representative of the expected input).

In the continuity of a N-gram approach you might want to try a dictionary based approach and check for the presence of 'stop words' (e.g. 'the', 'a', 'an', 'of') in the input text.

Solution 2:

Most of the NLP-Libraries will do the job (Spacy is a very common one). You can also go for language detection: Langdetect will support you on this (https://pypi.org/project/langdetect/) as many others will do. If you need to be less specific (more math than language) you should look for Phonotactics (with BLICK for Python: https://github.com/mmcauliffe/python-BLICK) that looks into the construction of character order in a string.

Solution 3:

Do a hexdump and make sure each character is less than or equal to 0x7f.

Post a Comment for "How To Detect Is Text Human Readable?"