# Benchmarking Data Tables

A look into the claim performance of Data Tables in the R language.
How-to
R
Author

Mark Edney

Published

April 13, 2022 When I started learning R, I heard vague tales of the use of Data Tables. Really just whisperers, of something to consider in the future after I’ve become more proficient. Well now is the time to learn what if anything I’ve been missing out on.

## Introduction

Data Tables are a potential replacement for the common dataframe. It seeks to perform that same role but with improved performance. I would like to see the speed comparison between Data Frames, Data Tables and Tibbles. I will use the `microbenchmark` package to perform the actual benchmarking.

``````library(tidyverse)
library(data.table)
library(microbenchmark)
library(farff)``````

For the benchmark, I will use the ‘credit-g’ dataset, which can be found on the open ml website. I’m pretty sure the last open ml dataset I used was a csv file, but they seem to have moved to a ARFF format. I will need to use the `farff` package to load the data.

``````df <- farff::readARFF('dataset_31_credit-g.arff')
dt <- setDT(df)
ti <- tibble(df)``````

## Syntax

The syntax for Data Tables is a little different:

DT[i,j,by]

In this manner, a data table can be subset by i, to calculate j when grouped with a by. Along with the special syntax, there are some common functions that add some additional simplification.

.()

The ‘.()’ function can be used as a placeholder for ‘list()’. The list function is useful for subsetting.

## Grouped Aggregate

Aggregating data in Data Tables is simple by using the j and by parameters in the syntax. Again, multiple functions or even multiple groupings can be passed with the ‘.()’ function. For this comparison, we will look at the performance of finding the average age of the credit holders grouped by the class or credit rating.

``````group <- microbenchmark(Data_Frame = df %>%
group_by(class) %>%
summarise(avg = mean(age)),
Data_Table = dt[,.(avg = mean(age)), by = class],
Tibble = ti %>%
group_by(class) %>%
summarise(avg = mean(age)))
print(group)``````
``````Unit: microseconds
expr    min     lq     mean  median      uq     max neval
Data_Frame 5406.5 5564.6 6138.870 5904.40 6306.95 17135.6   100
Data_Table  623.6  766.8  941.930  885.95  958.05  7471.7   100
Tibble 5602.8 5783.1 6586.445 6103.15 6902.65 14356.9   100``````

## Taking counts

Another function of interest is the ‘.N’ function. This function will return the count of rows. The test looks are the number of people with over 5000 in credit and younger than 35.

``````counts <- microbenchmark(Data_Frame = df %>%
filter(credit_amount > 5000, age <35) %>%
nrow(),
Data_Table = dt[credit_amount > 5000 & age < 35, .N ,],
Tibble = ti %>%
filter(credit_amount > 5000, age <35) %>%
nrow())
print(counts)``````
``````Unit: microseconds
expr    min      lq      mean  median      uq     max neval
Data_Frame 8404.1 8584.90  9576.638 8927.45 9867.95 17254.5   100
Data_Table  276.7  309.50   429.832  451.80  466.35  1045.6   100
Tibble 8945.1 9204.75 10206.312 9405.45 9942.45 27081.9   100``````

## Creating new columns

Data Tables also contain a very simple syntax for creating a new column with ‘:=’. I compare this to the `tidyverse` mutate function. Using the base R to create a column is still the fastest method, taking about half the time of the Data Table method.

``````new <- microbenchmark(Data_Frame = df %>% mutate(property = paste(property_magnitude, housing)),
Data_Table = dt[,property := paste(property_magnitude, housing)],
Tibble = ti %>% mutate(property = paste(property_magnitude, housing)))
print(new)``````
``````Unit: microseconds
expr    min      lq     mean median      uq    max neval
Data_Frame 2071.3 2148.55 2396.203 2259.6 2586.40 3386.1   100
Data_Table  505.7  580.00  657.802  656.6  694.35  996.0   100
Tibble 2585.3 2754.85 3187.554 2922.9 3331.70 8879.8   100``````

## Chaining Data Tables

Another point of exploration is that Data Tables can be chained together to create more complicated structures

``dt[credit_amount > 1000, .(age = mean(age)),by = .(purpose, class)][class == "good" & age < mean(age)]``
``````               purpose class      age
2: furniture/equipment  good 33.21930
3:            used car  good 36.91860