---
title: "Introduction to 'dataMojo'"
author: "Jiena Gu McLellan"
date: "`r Sys.Date()`"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Introduction to 'dataMojo'}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r setup, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
```

Built on the top of 'data.table', 'dataMojo' is a grammar of data manipulation with 'data.table', providing a consistent a series of utility functions that help you solve the most common data manipulation challenges:

* Calculate the row wise percentage
* Calculate the survey type percentage table
* Select columns
* Split one column to multiple columns based on patterns
* Filter cases based on their values
* Fill missing values
* Summarize and reduces multiple values down to a single summary
* Reshape `long to wide` or `wide to long`

## Calculate the row wise percentage

Calculate the row wise percentage of a frequency table

```{r row}
library(dataMojo)
library(data.table)
test_df <- data.frame(
      Group = c("A", "B", "C"),
      Female = c(2,3,5),
      Male = c(10,11, 13)
    )
print(test_df)
dataMojo::row_percent_convert(test_df, cols_rowsum = c("Female", "Male"))
```

## Calculate the survey type percentage table for *single* question

```{r survey1}
library(dataMojo)
library(data.table)
   test_dt <- data.table::data.table(
      Question = c(rep("Good", 3), rep("OK", 3), rep("Bad", 3)),
      Gender = c(rep("F", 4), rep("M", 5))
    )
   print(test_dt)
   dataMojo::pivot_percent_at(test_dt, 
                                 question_col = "Question", aggregated_by_cols = "Gender")
```

## Calcuate the survey type percentage table for *multiple* question

```{r survey2}
library(dataMojo)
library(data.table)
test_dt <- data.table(
      Question1 = c(rep("Good", 3), rep("OK", 3), rep("Bad", 3)),
      Question2 = c(rep("Good", 2), rep("OK", 2), rep("Bad", 5)),
      Gender = c(rep("F", 4), rep("M", 5))
    )
print(test_dt)
dataMojo::pivot_percent_at_multi(test_dt, 
                                    question_col = c("Question1","Question2") , aggregated_by_cols = "Gender") 
```

## Calculate the column wise percentage with desired numerator and denominator 

This function is to calculate column-wise percentage in a new column with desired numerator columns and denominator columns. If denominator is 0, the percentage will be `N/A`. 

```{r col_wise_percentage}
library(dataMojo)
test_df <- data.frame(
      hc1 = c(2, 0, 1, 5, 6, 7, 10),
      hc2 = c(1, 0, 10, 12, 4, 1, 9 ),
      total = c(10, 2, 0, 39, 23, 27, 30)
    )
print(test_df)
dataMojo::col_cal_percent(test_df, 
                          new_col_name = "hc_percentage", 
                          numerator_cols = c("hc1", "hc2"), 
                          denominator_cols = "total"
                          ) 
```

## Select columns
Select variables in a data table. You can also use predicate functions like is.numeric to select variables based on their properties (e.g. 1:3 selects the first column to the third column).

```{r ex1}
library(dataMojo)
library(data.table)
data("dt_dates")
dt_dates <- setDT(dt_dates)
dataMojo::select_cols(dt_dates, c("Start_Date", "Full_name"))
```


## Split a column
Split a column with its special pattern, and assign to multiple columns respectively. For example, split full name column to first name and last name column. 

```{r ex2}
data("dt_dates")
library(data.table)
data("dt_dates")
dataMojo::str_split_col(dt_dates,
              by_col = "Full_name",
              by_pattern = ", ",
              match_to_names = c("First Name", "Last Name"))
```


## Filter cases based on values

`filter_all()` is to return a data table with **ALL** columns (greater than/ less than/ equal to) a desired value. 

```{r ex3}
data("dt_values")
dataMojo::filter_all(dt_values, operator = "l", .2)
```

`filter_any()` is to return a data table with **ANY** columns (greater than/ less than/ equal to) a desired value. 


```{r ex4}
data("dt_values")
dataMojo::filter_any(dt_values, operator = "l", .1)
```

Similarly, `filter_all_at()` is to return a data table with **ALL selected** columns (greater than/ less than/ equal to) a desired value. 

```{r ex5}
data("dt_values")
dataMojo::filter_all_at(dt_values, operator = "l", .1, c("A1", "A2"))
```

Similarly, `filter_any_at()` is to return a data table with **ANY selected** columns (greater than/ less than/ equal to) a desired value. 


```{r ex6}
data("dt_values")
dataMojo::filter_any_at(dt_values, operator = "l", .1, c("A1", "A2"))
```

## Fill missing values

`fill_NA_with()` will fill NA value with a desired value in the selected columns. If `fill_cols` is `All` (same columns type), it will apply to the whole data table.

```{r ex7}
data("dt_missing")
dataMojo::fill_NA_with(dt_missing, fill_cols = c("Full_name"), fill_value = "pending")
```

## Group by and summarize

`dt_group_by()` is to group by desired columns and summarize rows within groups. 


```{r ex8}
data("dt_groups")
print(head(dt_groups))
```

Now we see the `dt_groups` data table has A1, A2 as numeric columns, and group1, group2 as group infomation. 

```{r ex9}
data("dt_groups")
dataMojo::dt_group_by(dt_groups, 
            group_by_cols = c("group1", "group2"), 
            summarize_at = "A1", 
            operation = "mean")
```

Now we want to group by group1 and group2, then fetch the first within each group, we can use `get_row_group_by()` function. 

```{r ex10}
data("dt_groups")
dataMojo::get_row_group_by(dt_groups, 
                 group_by_cols = c("group1", "group2"), 
                 fetch_row = "first")
```

or last row with same example. 

```{r ex11}
data("dt_groups")
dataMojo::get_row_group_by(dt_groups, 
                 group_by_cols = c("group1", "group2"), 
                 fetch_row = "last")
```


## Reshape `long to wide` or `wide to long`

Here is an example of reshaping a data table from wide to long. 

```{r ex12}
data("dt_dates")
print(head(dt_dates))
dataMojo::reshape_longer(dt_dates, 
               keep_cols = "Full_name", 
               label_cols = c("Date_Type"), 
               value_cols = "Exact_date")
```

Here is an example of reshaping a data table from long to wide. 

```{r ex13}
data("dt_long")
print(head(dt_long))
dataMojo::reshape_wider(dt_long, 
              keep_cols = c("Full_name"), 
              col_label = c("Date_Type"), 
              col_value = "Exact_date")
```

## Advanced Topic: expand row based on pattern

`row_expand_pattern()` is to expand rows based on a desired column. 

```{r ex14}
data("starwars_simple")
starwars_simple[]
row_expand_pattern(starwars_simple, "films", ", ", "film")[]
```


## Advanced Topic: expand row given start and end dates

`row_expand_dates()` is to expand rows to each date given start and end dates. 

```{r ex15}
dt_dates_simple <- data.table(
  Start_Date = as.Date(c("2020-02-03", "2020-03-01") ),
  End_Date = as.Date(c("2020-02-05", "2020-03-02") ),
  group = c("A", "B")
)
dt_dates_simple[]
row_expand_dates(dt_dates_simple, "Start_Date", "End_Date", "Date")[]

```