Happy New Year! 2020 Retro and 2021 Plans

December 31, 2020 Stanley RuanLeave a comment

Key components of this blog: My career development and milestones in 2020 My observations as an Analyst-turned Data Scientist and tips for future Data Scientists My goals for 2021 Happy New Year everyone! After a turbulent 2020 marked by a crazy pandemic and social injustice elevated by the Trump administration, we are finally starting 2021… Continue reading Happy New Year! 2020 Retro and 2021 Plans

Gibberish Detection Using Brown Corpus and NLP Techniques

September 27, 2019September 27, 2019 Stanley RuanLeave a comment

Today I am going to share a Python script that would enable you to detect gibberish, or unusual Anglo-Saxon words (i.e. English, European languages) using NLP techniques with Python. To give you a little bit of background, Brown corpus is a dictionary that contains 1 million common English words. Despite comprising of only English words,… Continue reading Gibberish Detection Using Brown Corpus and NLP Techniques

Merging Human-In-The-Loop with Trust & Safety – The Best of Both Worlds

May 27, 2019 Stanley RuanLeave a comment

From Facebook’s data breach to rising identify theft around the world and strict European GDPR laws, Trust & Safety is becoming a critical part of major global companies that are data-driven. With 2 plus years of experience as a Trust & Safety Data Analyst, I’ve built various data models highly effective at surfacing anomalies among… Continue reading Merging Human-In-The-Loop with Trust & Safety – The Best of Both Worlds

Sentiment Analysis with Python (Simple Way)

January 22, 2018January 25, 2018 Stanley RuanLeave a comment

For those of you who have been following my blog consistently, you may have recalled that sometime in 2016, I had written an article on Sentiment Analysis with R using Twitter data (link). If you did, I hope you enjoyed reading it. Nowadays, it is hard to argue against the fact that Python is quickly… Continue reading Sentiment Analysis with Python (Simple Way)

ANOVA with Python

January 20, 2018January 21, 2018 Stanley RuanLeave a comment

Today I want to show you a simple code to conduct multi-sample ANOVA test and subsequently t-test with Python’s powerful scipy package. ANOVA is handy when you want to compare more than 2 samples to see if their differences (if any) are statistically significant. The test is widely used in A/B testing, comparison of automobile… Continue reading ANOVA with Python

Data Randomization

October 21, 2017 Stanley RuanLeave a comment

Has there been a time where you have a have a large dataset file that contains a lot of columns and up to millions of rows and you want to randomly pick n number of rows from each ID within that file? What if instead of doing that from millions of rows you only have… Continue reading Data Randomization

Simple Document Control Script

October 21, 2017October 21, 2017 Stanley RuanLeave a comment

In this article I have included a link to a Python script I wrote that can serve some document control and record-keeping purposes. Let’s say that for record-keeping or administrative reasons, you have to download a zipped file from the web on a regular basis in order to ensure you have the up-to-date version of… Continue reading Simple Document Control Script

Introduction to Applied Natural Language Processing with Python

October 7, 2017January 20, 2018 Stanley Ruan1 Comment

Data source: NLP Github page In this blog I am excited to share a simple natural language processing methodology that I had learned very recently. The methodology is called semantic analysis and you can run it using Python’s NLTK (natural language toolkit) package. There are many functions in NLTK package that can achieve semantic analysis. The… Continue reading Introduction to Applied Natural Language Processing with Python

Abusive Behaviors in Software Platforms – Trust & Safety Perspective

July 27, 2017July 29, 2017 Stanley Ruan1 Comment

Guideline: This article will share some observations I have made about abusive users and contributors in the use of social media sites and crowdsourcing products, and the challenges and opportunities in uncovering these users. I hope that it would shed some lights on data analysts/scientists dealing with the trust & safety side of crowdsourcing platforms,… Continue reading Abusive Behaviors in Software Platforms – Trust & Safety Perspective

Support Vector Machine (Supervised Machine Learning) using R

September 11, 2016September 11, 2016 Stanley Ruan2 Comments

In this article I will talk about a supervised machine learning tool known as Support Vector Machine (SVM). I chose to write about SVM because it is one of the most commonly used and one of the most easily-implemented machine learning techniques. In addition, its’ methodology bears some similarities to the K-Nearest Neighbor (KNN) supervised… Continue reading Support Vector Machine (Supervised Machine Learning) using R

Data Analysis Blog

For enthusiastic readers with interest in data and statistical analysis.