Data Randomization

Has there been a time where you have a have a large dataset file that contains a lot of columns and up to millions of rows and you want to randomly pick n number of rows from each ID within that file? What if instead of doing that from millions of rows you only have to do that on a CSV file that contains hundreds of thousands of rows? In the latter case, you could choose to run a randomization formula in Excel or Google Sheet to achieve what you want. However, whether you run that formula in Excel or Google Sheet, you would likely have to wait a long time until the calculation finishes. In the case of Excel, the formula might even crash the program if you are running other memory-hungry applications in the background. 

If you’ve come across any of the problems above and you have yet figured out a more efficient way to tackle it, you’ve came to the right place!

The solution is actually quite simple, and can be achieved with a Python script optimized for data frame. (I am sure there are other efficient ways to arrive at the same result that I am not yet aware of. So if you guys known of alternative methods, feel free to share them in the comment section and we could learn together) You could access the script via this link: Data_randomization_script. As you read through the script, you will see various comments that guide you through what each line of the script does.  

Hope you guys find this short tutorial helpful! 🙂

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s