Today I am going to share a Python script that would enable you to detect gibberish, or unusual Anglo-Saxon words (i.e. English, European languages) using NLP techniques with Python.
To give you a little bit of background, Brown corpus is a dictionary that contains 1 million common English words. Despite comprising of only English words, the Brown corpus is excellent because many of the European languages share similar style as English words.
The lines below extract strings from the ‘utterance_question’ column to a single long string called ‘standard_phrase’, apply lower case to all letters, replace ‘\n’ with spaces, and then append them to the Brown corpus. The end result ‘standard_phrase’ is a long string that consists of words from utterance question column as well as the millions of words from the Brown corpus.
standard_phrase = ' '.join(whole_full[utterance_question].tolist())
standard_phrase = standard_phrase.lower()
standard_phrase = standard_phrase.replace('\n', ' ')
text = '\n'.join([' '.join([w for w in s]) for s in brown.sents()])
text = text.lower()
standard_phrase = standard_phrase + ' ' + text
The next 4 lines below counts the occurrences of every single-letter, double-letter, and triple-letter and put them into containers called “unigrams”, “bigrams” and “trigrams”.
unigrams = Counter(standard_phrase)
bigrams = Counter(standard_phrase[i:(i+2)] for i in range(len(standard_phrase)-2))
trigrams = Counter(standard_phrase[i:(i+3)] for i in range(len(standard_phrase)-3))
The next block of scripts creates a function that first assigns weights to the unigrams, bigrams and trigrams. (Weight 0.001 is assigned to unigrams, 0.01 to bigrams and 0.989 to trigrams. These weights are flexible and can be updated according to the nature of your dataset. For example, feel free to assign higher weights to bigram and lower weights to trigrams if you want to balance their impacts slightly more evenly. But generally speaking, the longer the characters, the higher the weight we want to assign them. The reason is that the letters ‘ley’ should be weighed more than the letter ‘y’ because it’s much more likely for ‘y’ to appear simply by chance than ‘ley’ – even a gibberish name could contain the letter ‘y’ but it’s less likely for a gibberish name to contain ‘ley’.)
Next, we multiply the weights by the occurrences of the unigrams, bigrams and trigrams and save them as a variable called ‘likelihood’ so that the higher the occurrences of the characters, the higher the ‘likelihood’. After that, we create a variable called ‘density’ which essentially sums occurrences of all unigrams and applying higher weights of unigrams and bigrams. It then divides ‘likelihood’ with ‘density’, apply natural log to it, and then returns the output. The higher the output, the more ‘gibberish’ or unusual the utterance tends to be.
weights = [0.001, 0.01, 0.989]
r = 0
if len(standard_phrase) >= 3:
for i in range(2, len(standard_phrase)):
single_char = standard_phrase[i]
two_chars = standard_phrase[(i-1):i]
three_chars = standard_phrase[(i-2):i]
likelihood = unigrams[single_char] * weights + bigrams[two_chars+single_char] * weights + trigrams[three_chars+single_char] * weights
density = sum(unigrams.values()) * weights + unigrams[single_char] * weights + bigrams[two_chars] * weights
r -= np.log(likelihood / density)
print('likelihood is ' + str(num))
print('sum of unigrams.values() is ' + str(sum(unigrams.values())))
print('den is ' + str(density))
return r / (len(standard_phrase) - 2)
elif len(standard_phrase) <= 2:
The next few lines applies the ‘strangeness’ function mentioned above into each row of utterances in the whole_full CSV, basically doing the calculation called out in the function and assigns a gibberish score to each utterance.
gibberish_level = 
for index, row in whole_full.iterrows():
whole_full['gibberish_level'] = gibberish_level
The line below sorts the output data by gibberish score in descending order.
whole_full = whole_full.sort_values(by='gibberish_level',ascending=False)
The line below outputs the result into a csv and stores it in the local drive.
whole_full.to_csv('/Users/stanleyruan/Desktop/gibberish_detection/utterance_gibberish_detection_' + str(b) + '.csv')
You can find the results in the attached Excel file in this article. Note that all the gibberish entries were correctly returned in the top of the list with the highest gibberish scores (i.e. “hjkyukklukuikil jhkkhjkhgkghhkhjk hkhjkhjkghkghkghk gkgyukyuyugkgyuk” and “uykyukyukgyukuy uykyukyukgyukyu uykyukyukyukuykuy uykygukgyukgyukuykuyuky kygkyukyukyukgyuk”), whereas legitimate utterances such as “location of the car look for the car find the car” were placed towards the bottom of the list with lowest gibberish scores.
This is it and hope you guys enjoying reading it and happy learning!!