As information technology advances and globalization of businesses progresses, the need to efficiently manage large and complex data becomes more critical. While many are aware of the growing importance of big data, professionals who are not directly involved with data analysis might not have experience handling large raw datasets, let alone the ability to decipher big data using non-corporate (less-powerful) hardware. This blog tries to fill this gap by explaining how to read large datasets quickly and in memory-efficient way using a personal computer. I will begin by describing a large dataset with 242 million total entries, then talk about the R codes to read the truncated data in csv and txt format (truncated due to size limitation of csv & txt format), and eventually R codes to read the complete dataset in tsv format.
Data description:
The dataset I will be using is the Wikipedia Participation’s Challenge I found from Kaggle. It consists of more than 22 million Wikipedia edit records made by 44514 editors between January 2001 and August 2010. Because there are 11 attributes (column) in the data file, that makes a total of 242 million cells filled with data! The most important attributes are “user_id (int)”, “article_id (int)” and “revision_id (int)”. Below is a description of these 3 attributes:
user_id (INT): id of the editor who made the revision. This has been randomly recoded and does not match an editor id from the Wikimedia website. This variable can be stored as an integer.
article_id (INT): id of the article to which the revision belongs. This variable can be stored as an integer.
revision_id (INT): id of the revision. This variable can be stored as an integer.
Descriptions for the other 8 attributes can be found in this Kaggle page. For readers not familiar with Kaggle, it is a platform for data miners and statisticians to share insights and produce predictive analytics models.
Obtaining the raw data:
To download the dataset, simply go to the link I provided above and you will see a file named “wikichallenge_data_all”. Download that file, unzip it and you’ll get a 2GB raw dataset named “training.tsv” as shown in screenshot below. That’s the dataset mentioned earlier that contains the more than 22 million rows and we will use it.
Now if you try to open the dataset in Notepad, an error message will pop up saying the file is too large to open. If you try it with Wordpad, it will allow you to open but you won’t be able to see the records in Wordpad as the whole page will be blank due to format rendering issues. Another avenue you can try is to save the file in csv or txt format. The file size will be 93MB in both formats. There’s a problem though: Excel can only read about 1 million rows and TXT file has similar limitation. That’s a lot of data but still leaves 95% of the raw data unseen! Regardless, because some older PCs have sub-par specs, even working with a 93MB Excel file might not be very time and memory efficient. Thus, in the section right underneath I will discuss effective ways to read large csv file using R. I will then talk about ways to read the raw tsv dataset with > 22 million rows.
Reading the truncated data in Excel and TXT format (data are truncated by Excel due to software limitations):
The first step is to install the “ffbase” package in R using the following codes:
> Install.packages(“ffbase”)
> Require(ffbase)
The “ffbase” package is a memory-efficient way to read large csv and txt files and transform the data into data frame.
Below are formulas that shows the time it takes for R to read the csv and txt format dataset:
> system.time(BDCSV <- read.table.ffdf(file=”Wikidata.csv”, FUN = “read.csv”, na.strings = “”))
user system elapsed
58.24 0.29 58.72
> system.time(BDDF <- read.csv.ffdf(file=”Wikidata.txt”, header=TRUE))
user system elapsed
57.41 0.11 57.71
As you can see, reading csv and txt files takes about the same time: approx. 58 seconds for the user. This may seem very fast. But note the time depends on hardware specification. My PC has Intel i5 processor and 8GB of memory. This definitely helps in speeding up the time. It’s not uncommon for some computers to take three times longer to load the same files if the computers are old.
Because of the large size of the csv file, I won’t be uploading it onto this blog. A screenshot of the first 23 records and the 11 columns is shown in the screenshot below
Below are some useful codes to find records from the dataset based on specific criteria:
> dim(bd)
*Returns length and number of columns
> str(bd[1:11,])
*Returns column names and column types. Shows the first 11 rows of data.
> result <- list()
> result$user_ID <- unique(bd$user_id)
> result$article_ID <- unique(bd$article_id)
*These codes are to return all the unique records of the column in parenthesis.
> result$user.article <- unique(bd[c(“user_id”,”article_id”)])
*Returns records that are unique combinations of “user_id” and “article_id”.
Reading original (complete) data in tsv format:
As promised before, I will now talk about how to read the complete raw dataset (un-truncated) using R.
Note that we previously used “read.table.ffdf” and “read.csv.ffdf” to read data truncated in csv and txt format. Because we are reading tsv format data now, we use “fread” function instead. (Don’t use the regular “read.csv” or “read.table” function to read this tsv file. When I tried that R kept loading until it crashed in about 40 minutes)
> bd <- fread(“training.tsv”,header=TRUE)
Read 22126031 rows and 11 (of 11) columns from 2.000 GB file in 00:12:07
Look at the R message generated by “fread” input function. It took R 12 minutes to read the 2GB file with about 22 million rows and 242 million total entries. That’s not bad considering it took the normal “read” command a minute to read the csv/txt file which contains “only” 1 million rows.
> dim(bd)
[1] 22126031 11
The “dim” command returns the dimension of the “bd” (training.tsv) dataset. The first number indicates number of rows and the second number indicates number of columns.
> str(bd[1:11,])
Classes data.table and ‘data.frame’: 11 obs. of 11 variables:
$ user_id: int 389427 389427 389427 389427 389427 389427 389427 389427 389427 870433 …
$ article_id: int 9445560 9445560 9445560 9445560 9445560 864287 864287 864287 864287 4205653 …
$ revision_id: int 279555872 279556938 304562912 304563043 346862274 14648017 14648035 14648632 14666943 41325786 …
$ namespace: int 0 0 0 0 0 0 0 0 0 0 …
$ timestamp: chr “2009-03-25 10:31:33” “2009-03-25 10:41:09” “2009-07-27 22:23:46” “2009-07-27 22:24:41” …
$ md5: chr “175d8a73b80b3c95719c458b698e5179” “48fa53402b2819283472a899a30117a1” “28b4e603c11f65343c39fb2946615925” “cac360b626a79d08f7369d59b7d6c475” …
$ reverted: int 0 0 0 0 0 0 0 0 0 0 …
$ reverted_user_id: int -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 …
$ reverted_revision_id: int -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 …
$ delta: int 276 97 18 4 -11 111 8 -6 23 28 …
$ cur_size: int 1661 1758 1776 1780 2227 368 376 370 393 2642 …
– attr(*, “.internal.selfref”)=<externalptr>
The “str” function lists all column names and data types of the “training.tsv” dataset. It also shows records from the first 11 rows since that’s what the function specified.
Below are codes to return unique records of the first 2 attributes as well as records that are unique combinations of these 2 attributes.
> result <- list()
> result$user_ID <- unique(bd$user_id)
> result$user_ID
* [1]-[9971] not shown here for brevity
[9972] 937838 334651 322848 794732 533663 48203 468135 350981 67269 174959 191967 217514 701465
[9985] 371438 525098 216206 664055 160793 497053 500646 483509 935701 799004 739991 723903 776599
[9998] 781039 406985 473457
[ reached getOption(“max.print”) — omitted 34514 entries ]
> result$article_ID <- unique(bd$article_id)
* [1]-[9970] not shown here for brevity
[9971] 2069631 1792781 10945811 1926746 344062 8404008 435355 4051486 4142291 481459
[9981] 50268 10057916 9249082 2717634 515839 1612808 2730462 1625778 9482079 2752263
[9991] 248038 12543266 12555886 12571809 1652111 535399 4556233 539170 4565521 540083
[ reached getOption(“max.print”) — omitted 4873700 entries ]
> df<-data.frame(bd)
*Show unique combination records:
> result$user.article <- unique(df[c(“user_id”,”article_id”)])
> result$user.article
* Rows 1-10817 not shown here for brevity
10818 905228 3470456
10819 905228 3470923
10820 905228 692849
10835 905228 7708631
[ reached getOption(“max.print”) — omitted 9648117 rows ]
You can see the codes are nearly identical to codes used to analyze the csv/txt with the only difference being the extra “data.frame” function to convert tsv data into data frame. If the data is not converted to data frame, R will return the following error message when you run the unique combination function:
Error: When i is a data.table (or character vector), x must be keyed (i.e. sorted, and, marked as sorted) so data.table knows which columns to join to and take advantage of x being sorted. Call setkey(x,…) first, see ?setkey.
While R codes can be helpful for those who have a background in statistical analysis, professionals specialized in data informatics might not be familiar with R. The good news is there’s actually SQL-equivalent commands to pull data with R! I had written a lot of SQL queries during my Google project helping their supply chain team pull records from the database and I can say that SQL is a very efficient way to pull specific data from a large database. There’s a catch though. Running SQL functions in R is generally much slower than running the “fread” function. In my experience, “read.csv.sql” function takes on average 4 times more than “fread” function in reading input. Since it took “fread” 12 minutes to load the dataset, it will take about 48 minutes for “read.csv.sql” to load the same dataset. Due to time constraint, I will only include the SQL-equivalent codes without listing their outputs:
> install.packages(“sqldf”)
> library(sqldf)
If you want to retrieve all rows where user_id is 389427, run the following code:
> read.csv.sql(“training.csv”, sql=”select * from file where user_id=’389427′ “)
Alternatively, if you want to retrieve all unique user_ids or article_ids like we did with the “unique” function earlier, run the following code:
> read.csv.sql(“training.csv”, sql=”select distinct user_id from file “)
> read.csv.sql(“training.csv”, sql=”select distinct article_id from file “)
That’s it for reading and analyzing big data with R! One thing to keep in mind is that while the “fread” function is helpful in reading large tsv dataset, it might not work for data file that has at least 150 million rows and 29 columns. Most of today’s high-end computers have no more than 8GB memory so it’ll be a problem reading those large records. An alternative way is to use some Hadoop or MapReduce methods, both of which I am still learning and is beyond the scope of this blog. You can learn more about this by reading this R blog written by Quantitative Finance Collector.
If you want to practice reading big data from R, feel free to download some open data depositories from various industries via this Quora forum. Let me know if you have any thoughts or questions in the comment section. Good luck! 🙂