├── data ├── bikes.xlsx ├── orders.xlsx ├── bike_sales.Rds ├── bikeshops.xlsx └── customer_product_interactions.xlsx ├── figures └── OrderSimProcess.jpg ├── scripts ├── createProductQuantities.R ├── assignCustomersToOrders.R ├── createOrdersAndLines.R ├── assignProductsToCustomerOrders.R ├── orderScript.R └── createDatesFromOrders.R └── README.md /data/bikes.xlsx: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/mdancho84/orderSimulatoR/HEAD/data/bikes.xlsx -------------------------------------------------------------------------------- /data/orders.xlsx: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/mdancho84/orderSimulatoR/HEAD/data/orders.xlsx -------------------------------------------------------------------------------- /data/bike_sales.Rds: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/mdancho84/orderSimulatoR/HEAD/data/bike_sales.Rds -------------------------------------------------------------------------------- /data/bikeshops.xlsx: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/mdancho84/orderSimulatoR/HEAD/data/bikeshops.xlsx -------------------------------------------------------------------------------- /figures/OrderSimProcess.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/mdancho84/orderSimulatoR/HEAD/figures/OrderSimProcess.jpg -------------------------------------------------------------------------------- /data/customer_product_interactions.xlsx: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/mdancho84/orderSimulatoR/HEAD/data/customer_product_interactions.xlsx -------------------------------------------------------------------------------- /scripts/createProductQuantities.R: -------------------------------------------------------------------------------- 1 | createProductQuantities <- function(orders, maxQty = 50, rate = 1.2) { 2 | 3 | # Fifth and final step in order creation 4 | # Creates quantities of products on each line 5 | 6 | # Requires: 7 | ################################################# 8 | 9 | # orders: created from previous four steps 10 | # maxQty: Maximum quantity of products on an order. 11 | # rate: Uses function 1/i^rate to generate discrete probability for quantity of products on a line, where 12 | # i = 1:maxQty and rate manipulates the likelihood of seeing lower quantities versus higher quantities. 13 | # Values greater than zero cause a distribution weighted to lower quantitis on each line. 14 | # Values less than zero cause the distribution to be more heavily weighted to larger quantities on each line. 15 | 16 | require(dplyr) 17 | 18 | 19 | # Code: 20 | ################################################# 21 | 22 | # Code: 23 | ################################################# 24 | 25 | # Generate discrete probabilities for line quantities 26 | qtyProb <- NULL 27 | for (i in 1:maxQty) { 28 | qtyProb[i] <- 1/i^rate 29 | } 30 | qtyProb <- qtyProb/sum(qtyProb) 31 | 32 | # Generate order line quantities 33 | set.seed(100) # For reproducibility 34 | quantity <- sample(x = 1:maxQty, 35 | size = nrow(orders), 36 | replace = T, 37 | prob = qtyProb) 38 | 39 | # Combine orders and order.lines.qty into dataframe 40 | orders <- cbind(orders, quantity) 41 | orders 42 | 43 | } -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # orderSimulatoR 2 | 3 | ___Fast and easy `R` order simulation for customer and product learning!___ 4 | 5 | ## About 6 | 7 | `orderSimulatoR` enables fast and easy creation of order data for simulation, data mining and machine learning. In it's current form, the `orderSimulatoR` is a collection of scripts that can be used to generate sample order data from the following inputs: customer table (e.g. `bikeshops.xlsx`), products table (e.g. `bikes.xlsx`), and customer-products interaction table (e.g. `customer_product_interactions.xlsx`). The output will be order data. Example input files are provided (refer to the data folder). The output generated is similar to that in the file `orders.xlsx`. 8 | 9 | ## Why this Helps 10 | 11 | It's very difficult to create custom order data for data mining, visualization, trending, etc. I've searched for good data sets, and I came to the conclusion that I'm better off creating my own orders data for messing around with on my blog. In the process, I made an algorithm to generate the orders. I made the algorithm publicly available so others can use to show off their analytical abilities. T 12 | 13 | ## Creating Orders 14 | 15 | The process to create orders (shown below) is fast and easy, and the result is orders with customized trends depending on the inputs you create and the parameters you select. I've provided some sample data in the `data` folder to help with the explanation. The scripts used are in the `scripts` folder. [Click here](http://www.mattdancho.com/business/2016/07/12/orderSimulatoR.html) for an in-depth walkthrough. 16 | 17 | 18 | 19 | Order Simulation Process 20 | 21 | ## Example Usage 22 | 23 | * [ORDERSIMULATOR: SIMULATE ORDERS FOR BUSINESS ANALYTICS](http://www.mattdancho.com/business/2016/07/12/orderSimulatoR.html) - This is a walkthrough on how to simulate orders using `orderSimulatoR`. -------------------------------------------------------------------------------- /scripts/assignCustomersToOrders.R: -------------------------------------------------------------------------------- 1 | assignCustomersToOrders <- function(orders, customers, rate = 0.6) { 2 | 3 | # Third step in order creation 4 | # Assigns customers to order using a customer-line frequency distribution 5 | 6 | # Requires: 7 | ################################################# 8 | 9 | # orders: a dataframe of orders after performing the previous two steps 10 | # customers: a data frame of customers with ids in the first column 11 | # rate: Rate of probability of customer-orders, uses 1/i^rate to create 12 | # discreate probability where i is range of customers 13 | 14 | 15 | require(dplyr) 16 | 17 | 18 | # Code: 19 | ################################################# 20 | 21 | # Get customer ids and number of customers 22 | customer.id <- customers[,1] # Customer id vector 23 | n = length(customer.id) # Number of customers 24 | 25 | # Shuffle customer ids 26 | set.seed(100) 27 | customer.id.random <- sample(customer.id) 28 | 29 | # Generate distribution for customer order-line frequency 30 | custProb <- NULL 31 | for (i in 1:n) { 32 | custProb[i] <- 1/i^rate 33 | } 34 | custProb <- custProb/sum(custProb) 35 | 36 | # Sample random customers using the customer distribution 37 | order.id.unique <- unique(orders$order.id) 38 | set.seed(101) 39 | customerOrderAssignment <- sample(x = customer.id.random, 40 | size = length(order.id.unique), 41 | replace = T, 42 | prob = custProb) 43 | 44 | # Combine order-customer assignment for left_join 45 | orderCustomerDF <- as.data.frame(cbind(order.id.unique, customerOrderAssignment)) 46 | orderCustomerDF <- rename(orderCustomerDF, 47 | order.id = order.id.unique, 48 | customer.id = customerOrderAssignment) 49 | 50 | # Merge order-customer assigment with orders by order.id 51 | orders <- left_join(orders, orderCustomerDF) 52 | 53 | } -------------------------------------------------------------------------------- /scripts/createOrdersAndLines.R: -------------------------------------------------------------------------------- 1 | createOrdersAndLines <- function(n = 1500, maxLines = 30, rate = 0.8) { 2 | 3 | # First step in generating orders 4 | # Generates a list of orders and order lines, and returns a data frame 5 | 6 | # Requires: 7 | ################################################# 8 | 9 | # n: Number of orders to generate 10 | # maxLines: Maximum number of lines on an order. Note this cannot exceed number of product ids. 11 | # rate: Uses function 1/i^rate to generate discrete probability for lines on an order, where 12 | # i = 1:maxLines and rate manipulates the likelihood of seeing lower value lines versus higher value lines. 13 | # Values greater than zero cause a distribution weighted to lower line counts on each order. 14 | # Values less than zero cause the distribution to be more heavily weighted to larger line counts. 15 | 16 | 17 | # Code: 18 | ################################################# 19 | 20 | # Generate discrete probabilities for line counts 21 | lineProb <- NULL 22 | for (i in 1:maxLines) { 23 | lineProb[i] <- 1/i^rate 24 | } 25 | lineProb <- lineProb/sum(lineProb) 26 | 27 | # Generate unique order id's 28 | order.id.unique <- seq(1:n) 29 | 30 | # Generate order line counts 31 | set.seed(100) # For reproducibility 32 | order.lines.count <- sample(x = 1:maxLines, 33 | size = n, 34 | replace = T, 35 | prob = lineProb) 36 | 37 | # Generate list of order id's 38 | order.id <- NULL 39 | for (i in 1:n) { 40 | order.id <- c(order.id, 41 | rep(order.id.unique[i], order.lines.count[i])) 42 | 43 | } 44 | 45 | # Generate list of order lines 46 | order.line <- NULL 47 | for (i in 1:n) { 48 | for (j in 1:order.lines.count[i]) { 49 | order.line <- c(order.line, 50 | j) 51 | } 52 | } 53 | 54 | # Combine order.id and order.line in to dataframe 55 | orders <- as.data.frame(cbind(order.id, order.line)) 56 | orders 57 | 58 | } 59 | 60 | -------------------------------------------------------------------------------- /scripts/assignProductsToCustomerOrders.R: -------------------------------------------------------------------------------- 1 | assignProductsToCustomerOrders <- function(orders, customerProductProbs) { 2 | 3 | # Fourth step in order creation 4 | # Assigns product id's to customer-order-lines 5 | 6 | # Requires: 7 | ################################################# 8 | 9 | # orders: created from previous three steps 10 | # custProductProbs: a matrix linking each product.id to customers.ids, 11 | # with the value in each cell indicating the probability a particular 12 | # customer.id selecting a particular product.id 13 | 14 | 15 | require(dplyr) 16 | 17 | 18 | # Process product distributions for each customer 19 | ####################################################### 20 | 21 | product.id <- customerProductProbs[,1] 22 | customerProbMatrix <- customerProductProbs[,2:ncol(customerProductProbs)] 23 | 24 | # Make sure values in matrix columns sum to one 25 | customerProbMatrix <- t(t(customerProbMatrix)/colSums(customerProbMatrix)) 26 | 27 | 28 | # Assign products to customer-order lines 29 | ####################################################### 30 | 31 | # Get number of lines on each customer-order to sample the product list 32 | customerSamplesNeeded <- orders %>% 33 | group_by(order.id, customer.id) %>% 34 | summarise(line.count = n()) 35 | 36 | # Sample products according to customer-product probabilities 37 | for (i in 1:nrow(customerSamplesNeeded)) { 38 | set.seed(i) 39 | cust.id <- customerSamplesNeeded[[i,2]] # Retreive customer id for probability 40 | custProbability <- customerProbMatrix[,cust.id] 41 | customerSamplesNeeded$product.sampling[[i]] <- sample(x = product.id, 42 | size = customerSamplesNeeded$line.count[[i]], 43 | replace = FALSE, 44 | prob = custProbability) 45 | } 46 | 47 | # Unlist the samples to create a vector of length of order-lines 48 | product.samples <- as.list(customerSamplesNeeded$product.sampling) 49 | product.id <- unlist(product.samples) 50 | 51 | # Combine the orders with the products selected and return orders 52 | orders <- cbind(orders, product.id) 53 | orders 54 | 55 | } -------------------------------------------------------------------------------- /scripts/orderScript.R: -------------------------------------------------------------------------------- 1 | # ORDER SIMULATOR SCRIPT 2 | ################################################### 3 | 4 | # This script can be used as a template to generate orders. 5 | # The inputs to the order simulation script are the following data sets: 6 | # 1. customers.xlsx: An excel file of 30 customers including the customer.id, 7 | # customer.name, customer.city, and customer.state 8 | # 2. bikes.xlsx: An excel file of 97 bike models and various product data 9 | # 3. customer_product_interactions.xlsx: An excel file that with a matrix of 10 | # probabilities for the likelihood of a customer.id selecting the bike.id. 11 | # See the excel file for more information on how to create an interaction 12 | # matrix. 13 | 14 | source("./scripts/createOrdersAndLines.R") 15 | source("./scripts/createDatesFromOrders.R") 16 | source("./scripts/assignCustomersToOrders.R") 17 | source("./scripts/assignProductsToCustomerOrders.R") 18 | source("./scripts/createProductQuantities.R") 19 | 20 | require(xlsx) 21 | 22 | # Read customer, product, and customer-product interaction data 23 | #################################################### 24 | 25 | customers <- read.xlsx("./data/bikeshops.xlsx", sheetIndex = 1) 26 | products <- read.xlsx("./data/bikes.xlsx", sheetIndex = 1) 27 | customerProductProbs <- read.xlsx("./data/customer_product_interactions.xlsx", 28 | sheetIndex = 1, 29 | startRow = 15) 30 | customerProductProbs <- customerProductProbs[,-(2:11)] # Remove unnecessary columns 31 | 32 | 33 | # Create orders 34 | #################################################### 35 | 36 | # Step 1 - Create orders and lines 37 | orders <- createOrdersAndLines(n = 2000, maxLines = 30, rate = 1) 38 | 39 | # Step 2 - Add dates to the orders 40 | orders <- createDatesFromOrders(orders, 41 | startYear = 2011, 42 | yearlyOrderDist = c(.16, .18, .22, .20, .24), 43 | monthlyOrderDist = c(0.045, 44 | 0.075, 45 | 0.100, 46 | 0.110, 47 | 0.120, 48 | 0.125, 49 | 0.100, 50 | 0.085, 51 | 0.075, 52 | 0.060, 53 | 0.060, 54 | 0.045)) 55 | 56 | # Step 3 - Assign customer id's to order lines 57 | orders <- assignCustomersToOrders(orders, customers, rate = 0.8) 58 | 59 | # Step 4 - Assign product id's to orders based on the customer product probabilities 60 | orders <- assignProductsToCustomerOrders(orders, customerProductProbs) 61 | 62 | # Step 5 - Create product quantities for order lines 63 | orders <- createProductQuantities(orders, maxQty = 10, rate = 3) 64 | 65 | 66 | # Export order 67 | #################################################### 68 | 69 | # Warning: this step can take a significant amount of time depending on the dataset size 70 | write.xlsx(orders, file = "./data/orders.xlsx") 71 | -------------------------------------------------------------------------------- /scripts/createDatesFromOrders.R: -------------------------------------------------------------------------------- 1 | createDatesFromOrders <- function(orders, 2 | monthlyOrderDist = c(0.045, 3 | 0.075, 4 | 0.100, 5 | 0.110, 6 | 0.120, 7 | 0.125, 8 | 0.100, 9 | 0.085, 10 | 0.075, 11 | 0.060, 12 | 0.060, 13 | 0.045), # Seasonality of orders 14 | yearlyOrderDist = c(0.25, 0.325, 0.425), # Growth of orders in each year 15 | startYear = 2010 16 | ) { 17 | 18 | # Second step in order creation 19 | # Computes dates from the orders template. 20 | 21 | # Requires: 22 | ################################################# 23 | 24 | # orders: a dataframe of orders after performing Step 1 25 | # monthlyOrderDist: a vector of distributions (length = 12, sum = 1) that 26 | # indicate the distribution of orders received in each month of the year 27 | # yearlyOrderDist: a vector of distributions (length= optional, sum = 1) that 28 | # indicate the fluctuations in order over successive years. The length is used 29 | # to determine number of years that orders span 30 | # startYear: the year of the beginning of orders 31 | 32 | require(dplyr) 33 | require(tidyr) 34 | require(lubridate) 35 | 36 | 37 | # Code: 38 | ################################################# 39 | 40 | # Create distributions for random order dates. 41 | endYear <- startYear + (length(yearlyOrderDist)-1) 42 | years <- seq(from = startYear, to = endYear) 43 | 44 | 45 | # Get list of orders and count of orders 46 | orderList <- unique(orders$order.id) 47 | orderCount <- length(orderList) 48 | 49 | # Create random dates 50 | set.seed(100) # Needed for reproducible samples of month and years 51 | randomMonths <- sample(x=seq(1,12), size=orderCount, replace=TRUE, prob=monthlyOrderDist) 52 | randomYears <- sample(x=years, size=orderCount, replace=TRUE, prob=yearlyOrderDist) 53 | randomDays <- sample(x=seq(1,31), size=orderCount, replace=TRUE) 54 | 55 | # Combine random days, months and years 56 | randomDates <- as.data.frame(cbind(randomDays, randomMonths, randomYears)) 57 | names(randomDates) <- c("Day", "Month", "Year") 58 | 59 | # Check if random Day exceeds last day of month; if so, replace with last day of month 60 | randomDates <- randomDates %>% 61 | mutate(FirstOfMonth = ymd(paste(Year, Month, "01"))) %>% 62 | mutate(LastOfMonth = ceiling_date(FirstOfMonth + days(1), "month") - days(1)) %>% 63 | mutate(MaxDay = day(LastOfMonth)) %>% 64 | mutate(Day = ifelse(Day > MaxDay, MaxDay, Day)) 65 | 66 | # Get random date and day of week 67 | randomDates <- randomDates %>% 68 | mutate(Date = ymd(paste(Year, Month, Day))) %>% 69 | mutate(DOW = wday(Date)) 70 | 71 | # Take care of weekends which are non-working days 72 | for (i in 1:orderCount){ 73 | # Sundays 74 | if (randomDates[[i, 8]] == 1 && randomDates[[i,1]] <= 15){ 75 | randomDates[[i, 1]] <- randomDates[[i, 1]] + (2 + i%%2) # Split between Monday and Tues 76 | } 77 | if (randomDates[[i, 8]] == 1 && randomDates[[i,1]] > 15){ 78 | randomDates[[i, 1]] <- randomDates[[i, 1]] - (4 + i%%2) # Split between Thurs and Fri 79 | } 80 | # Saturdays 81 | if (randomDates[[i, 8]] == 7 && randomDates[[i, 1]] >= 15){ 82 | randomDates[[i, 1]] <- randomDates[[i, 1]] - (3 + i%%2) # Split between Tues and Wed 83 | } 84 | if (randomDates[[i, 8]] == 7 && randomDates[[i, 1]] < 15){ 85 | randomDates[[i, 1]] <- randomDates[[i, 1]] + (5 + i%%2) # Split between Thurs and Fri 86 | } 87 | } 88 | randomDates <- randomDates %>% 89 | mutate(Date2 = ymd(paste(Year, Month, Day))) %>% 90 | mutate(DOW2 = wday(Date2)) 91 | 92 | # Select date column only, and order dates in chronological order 93 | dates <- randomDates %>% 94 | select(Date2) %>% 95 | arrange(Date2) %>% 96 | rename(Date = Date2) 97 | 98 | # Join dates with unique orders 99 | order.dates <- as.data.frame(cbind(orderList, dates)) 100 | order.dates <- rename(order.dates, order.id = orderList, order.date = Date) 101 | 102 | # Add dates to template 103 | ordersAndLines <- orders %>% 104 | select(order.id, order.line) %>% 105 | left_join(order.dates, by = "order.id") 106 | 107 | # Return 108 | ordersAndLines 109 | 110 | } --------------------------------------------------------------------------------