Data source: NLP Github page
In this blog I am excited to share a simple natural language processing methodology that I had learned very recently. The methodology is called semantic analysis and you can run it using Python’s NLTK (natural language toolkit) package. There are many functions in NLTK package that can achieve semantic analysis. The focus of this article will be on the wup (Wu and Palmer) function, as I found this function to be more accurate than some other functions under the NLTK package. In a nutshell, semantic analysis looks at 2 or more words that are nouns, and then assigns a semantic score that indicates the similarities between these words. The score ranges from 0.00 for completely unrelated words to 1.00 if the words are perfectly similar or identical. If you have a small sample size (such as the dataset we use in this article) and the words are complicated ones, maybe a score as low as 0.20 could still be considered reasonable to indicate satisfactory match.
There are many applications of semantic analysis. For instance, when you have a large number of topics that you want to group together by similarity, you could run semantic analysis to group all topics that share a predetermined score threshold. On the other hand, if you want to verify the level of similarity between 2 groups of nouns, you could run semantic analysis to return the scores for validation purposes. Due to my relatively new exposure to this tool and also for the sake of simplicity, the focus of this introduction will be on the latter. Because of this, it this blog will be helpful for people who are relatively new to NLP. If you already have significant exposures to NLP, you could skip this material as you might find it too easy for you.
In this introduction I have attached both the Python script you need to obtain the NLTK package (along with other data manipulation packages) as well as the Excel CSV that contains columns of textual data you want to analyze with semantic analysis. A screenshot of the CSV file is as below:
As you can see, there are 3 columns in this CSV: ID, industry and company name. There are 5 IDs in this dataset; each ID consists of rows of textual data. The company name column consists of names of business entities. The industry column attempts to categorize the business entities into appropriate categories (For instance, a company called ‘Boston Consulting’ will be classified as something like ‘management consulting’ if done correctly. You will know it’s wrong if it gets classified as ‘transportation services’).
Now that we know enough about the raw data, you can navigate to my Python script to see the complete Python scripts I wrote.
After you’ve ran the script above, you should’ve gotten an output file containing information that resembles the image below:
Looking at the low scores you might be quick to think that all IDs must have a lot of poorly classified industries that do not truly represent the correct industries. However, if you examine the ‘industry’ column of the raw data, you can see that each industry category actually contains 3 layers ranked in granularity order, starting from the most general and trickling down to the most detailed one. Not only that, we have to be aware not all words in the company names are nouns without meanings and some are even foreign languages! Due to this complex nature of the words in this dataset, I would recommend 2 steps to help you determine what would be considered a passing score in this context: 1. Increase the observations if you could. For simplicity, there are only around 500 observations in this dataset and each ID consists of roughly 100 observations. While that’s not bad, due to complexity of the industry and company_name columns, increasing the observations to at least 2 times what we have now is a good starting point. 2. Audit some of the observations manually (Remember, human-in-the-loop could only make machine learning better or equal, but never worse) to help yourself identify what score threshold is indicative of good match.
So this is it, guys. I hope you all enjoyed reading this blog and also hope you’ve gained some good understand of one of the practical natural language processing tools in Python. I look forward to writing my next blog very soon 🙂