Project Overview

Welcome to my Stats 631¹ final project!

This digital portfolio stands as a testament to my proficiency in applied statistical methods and showcases my capacity for ongoing growth and adaptability in the ever-evolving domains of data science and machine learning. During the summer semester of 2023, I not only expanded my statistical knowledge but also cultivated my craft of data storytelling.

At its heart, this project is one hypothetical case study for predictive statistical modeling. Leveraging the R programming language, I conducted model selection and broadened my expertise in data wrangling, visualization, and design aesthetics.

More specifically, this document guides you through a time series inventory prediction— the art of anticipating future consumer demand. Ultimately, it demonstrates how data empowers the decision-making process through forecasting and strategic insights. And remember,

Summary

Objective

evaluate and deploy a predictive model to forecast the inventory volume.

Limitations

the data set only spans 54 weeks, which is less than ideal for time series predictions that typically benefit from multi-year trends. Additionally, my understanding of the data is limited. For this project, I made certain assumptions about the time frame and inventory type. In practical scenarios, collaboration with domain experts are essential components of any predictive analytics project.

Final Results

The Poisson regression model demonstrated reasonable predictive performance, using four predictors chosen through stepwise selection.

this report contains:

Data Processing
- feature engineering
- data splitting into training and testing sets
Model Development
- step-wise model selection methodology
Model Evaluation
- exploratory and descriptive analytics
- interpretation and evaluation of model outcomes

Initial Insights:

The inventory data is comprised of 488 distinct items, sold over a span of 54 weeks (from week 0 to 53), with an impressive sum of approximately 1.3 million units transacted.

Below, I’ve created several visualizations to discern patterns, identify anomalies, and appreciate the overall behavior of the inventory data over time while establishing a baseline familiarity for the predictive analysis.

Code

# import data
sai <-  read.csv("inventory.csv")
# import my custom theme :-)
source("artNov_theme.R")

hist <- sai %>% 
  group_by(item_no) %>%
  summarise(total_sold = sum(sold)) %>% 
 
ggplot(aes(x = total_sold)) +
  geom_histogram(binwidth = 2100, color = "black", fill = "#7D998F", 
                 alpha = .6) +
        scale_x_continuous(labels = scales::comma_format()) +
  
  labs(x = "Number of Items Sold", y = "Frequency", 
       title = "Number of Items Sold by Item Type", 
       caption =  " figure 1.1")+
    theme_art_nouveau()



hist1 <- sai %>% 
  group_by(week) %>%
  summarise(total_sold = sum(sold)) %>% 
 
ggplot(aes(x = total_sold)) +
  geom_histogram(binwidth =  2100, color = "black", fill = "#68576D", 
                 alpha = .6) +
        scale_x_continuous(labels = scales::comma_format()) +
        scale_y_continuous(position = "right") +
  
  labs(x = "Number of Items Sold", 
       title = "Number of Items Sold by Week",
        caption =  " figure 1.2")+
  theme_art_nouveau()

# "ggplot2 - Easy way to mix multiple graphs on the same page", STHDA
# http://www.sthda.com/english/wiki/wiki.php?id_contents=7930#
# display the histograms side by side
gridExtra::grid.arrange(hist, hist1, ncol = 2)

figure 1. The histograms demonstrate the distribution of the dataset by variable. Figure 1.1 plots the frequency of each item sold, with most items ranging between 500 and 1,300 in total. Figure 1.2 conveys the temporal dispersion of weekly sales. Both graphs exhibit a right skew, particularly pronounced in figure 1.1, implying a scarcity of high-sales items or weeks; both appear to have a poisson distribution.

Code

line1 <- sai %>%
  group_by(week) %>%
  summarise(total_sold = sum(sold)) %>%
  ggplot(aes(x = week, y = total_sold)) +
  geom_line(size = 2, alpha = 0.5, color = "#DF5875") +
  geom_point(size = 4, alpha = 0.5, color = "#DF5875")+
  geom_area(fill = "#DF5875", alpha = 0.4) +
  scale_x_continuous(breaks = seq(0, 55, by = 5)) + 
  scale_y_continuous(labels = scales::comma_format()) +
   
  labs(x = "week", y = "Count of Items Sold", 
       title = "Number of Items Sold by Week",
        caption =  " figure 2")+
  theme_art_nouveau()
line1

figure 2. shows annual sales trends on a weekly cadence. Weekly sales fluctuate within a range of 10k to 50k items, with the majority hovering between ~20k and ~30k units sold. A sales peak is observed in week 20, while week 46 registers the slowest sales activity.

Data Processing

From the raw data, I created new features (see data dictionary),that reveal sales dynamics and seasonal patterns. Ideally, this process includes consultation with domain experts to ensure predictor viability and relevance.

data dictionary

Original
- item_no: the unique identifier for each item
- sold: count of items_no sold
- week: the week number.
  - I assume the following dates for the data:
    - starting week 0: December 27, 2021 - January 2, 2022
    - ending week 53: December 26, 2022 - January 1, 2023

Feature Engineered
- year_total: the total number of items sold for each item over the entire year
- date: The specific date corresponding to the first day of the week
- month: the month corresponding to the first day of the week
- lag1: period for capturing seasonality or time-dependent patterns, item’s sold from the previous week
- lag2: item’s sold from two-weeks prior
- rolling_4wk_avg: the average number of items sold over the past four weeks by item_no
- best_seller: a binary variable indicating whether the item is one of the top ten best-selling items based on volume

Code

#####
# create the variables
# total yearly sales for each item
inventory_sum <- sai %>%
  group_by(item_no) %>%
  summarise(year_total = sum(sold), .groups = "drop")

sai <- left_join(sai, inventory_sum, by = c("item_no"))

###
# calculate the top 10% threshold
top_10_threshold <- quantile(sai$year_total, 0.9)

# dummy variable for bestsellers 1= yes 0 = no
sai <- sai %>%
  mutate(best_seller = ifelse(year_total >= top_10_threshold, 1, 0))

## figure out month an year for data, assuming the data is from  2022
# week 0: December 27, 2021 - January 2, 2022
# week 52: December 19, 2022 - December 25, 2022
# week 53: December 26, 2022 - January 1, 2023

sai$year <- 2022

sai <- sai %>%
  mutate(calendar_week = ifelse(week == 0, 52, ifelse(week == 53, 1, week)),
         calendar_year = ifelse(week == 0, year - 1, ifelse(week == 53, year + 1, year)))

#  calendar_year and calendar_week in the MMWRweek2Date function to make a date from week# and year
sai$date <- MMWRweek2Date(sai$calendar_year, sai$calendar_week)



# find the year the week ends in, so we can figure out months
sai <- sai %>%
  mutate(wk_ending_year = ifelse(week %in% c(0, 53), year + 1, year),
         week = ifelse(week == 53, 1, week))



# create the data and extract the month
# wanted month to be quantitative to reduce model complexity

sai$date <- MMWRweek2Date(sai$wk_ending_year, sai$calendar_week)
sai$month <- month(sai$date, label = TRUE)
sai$month <- month(sai$date)


# create lag periods
# "lead & lag R Functions of dplyr Package", StatsGlobe
# https://www.youtube.com/watch?v=Kn3llTjYS5E

sai <- sai %>%
  mutate(lag1 = lag(sold, 1),
         lag2 = lag(sold, 2))

# https://www.rdocumentation.org/packages/zoo/versions/1.8-12/topics/rollmean
sai <- sai %>%
  group_by(item_no) %>%
  mutate(rolling_4wk_avg = zoo::rollmean(sold, k = 4, fill = NA, align = "right")) %>%
  ungroup()

write.csv(sai, "FEinventory.csv")




month_abbreviations <- c("Jan", "Feb", "Mar", "Apr", "May", "Jun", "Jul", "Aug", "Sep", "Oct", "Nov", "Dec")

sai_average <-  sai %>%
  filter(wk_ending_year == 2022) %>% 
  group_by(month) %>% 
  summarise(month_average = mean(sold))



bar <- sai %>%
    filter(wk_ending_year == 2022) %>% 
    mutate(month = factor(month, levels = 1:12, labels = month_abbreviations)) %>%
  ggplot(aes(x = month, y = sold)) +
  geom_bar(stat = "sum", fill = "#DF5875", alpha = 0.75) +
  labs(x = "Month", y = "Total Sold", title = "Total Sold by Month",
       caption =  " figure 3.1") +
  scale_y_continuous(labels = scales::comma_format()) +
  theme_art_nouveau() +
  theme(legend.position = "none")

dot <-  sai_average %>%
  mutate(month = factor(month, levels = 1:12, labels = month_abbreviations)) %>%
  ggplot(aes(x = month, y = month_average)) +
  geom_point(size = 3) +
  geom_point(size = 4, shape = 8, alpha = 0.75) +
  geom_text(aes(label = round(month_average, 2)), nudge_y = 1.5) +  #closer annotations
  scale_y_continuous(position = "right") +
  labs(x = "Month", y = "", title = "Average Items Sold by Month", caption = "figure 3.2") +
  theme_art_nouveau() +
  theme(axis.text.y = element_blank())  
 
  

gridExtra::grid.arrange(bar, dot, ncol = 2)

figure 3. The bar graph displays the total sales volume by month. July stands out as the peak season, whereas November and December show lower inventory movement. However, the average number of items sold in a month is relatively low compared to the totals. This suggests the presence of numerous zeros, which are affecting the overall average.

Exploring Relationships Between Sales Indicators: A Correlation Matrix

Code

#https://www.geeksforgeeks.org/visualization-of-a-correlation-matrix-using-ggplot2-in-r/

sai %>%
  select(week, month, lag1, lag2, rolling_4wk_avg, best_seller, sold) %>%
  ggpairs(
    upper = list(continuous = wrap("cor", size = 5)),
    lower = list(continuous = "points"),
    diag = list(continuous = "densityDiag")
  ) +
  labs(caption = "Figure 4") +
  theme_art_nouveau()

figure 4. The correlation matrix reveals linear relationships between sets of variables. The data shows an expected high correlation between ‘Month’ and ‘Week’ (0.899), moderate correlation between the ‘Rolling 4-week Average’ and ‘Lag1’ (0.589), and a relatively low correlation between ‘Lag2’ and ‘Lag1’ (0.147). The absence of strong correlation between predictors is essential for avoiding multicollinearity, a factor that reduces model accuracy.

Splitting the Data

With the variables now set, it’s time to divide the data for training and testing the predictive model. In an ideal scenario, the previous year’s data would serve as a training ground, forming a foundation for model testing. However, due to the necessity of maintaining the temporal order, I segmented the data by inventory type, 75% into training and remaining 25% into testing.

Code

# create a list of all items to randomly select

# removing the NAs
sai <- sai %>%
  filter(!is.na(rolling_4wk_avg))

set.seed(1985)
items_sample_list <-  unique(sai$item_no) %>% 
  sample(366)
# 75% of the data


# splitting the inventory by item type to keep the temporal order

is_train <-  sai %>% 
  filter(item_no %in% items_sample_list) %>% 
  dplyr::select(
    item_no,
    week,
    month,
    lag1,
    lag2,
    rolling_4wk_avg,
    best_seller, 
    sold) #not sure why i had to include the dplyr package


is_test<-  sai %>% 
  filter(!(item_no %in% items_sample_list)) %>% 
  dplyr::select(
    item_no,
    week,
    month,
    lag1,
    lag2,
    rolling_4wk_avg,
    best_seller, 
    sold) #not sure why i had to include the dplyr package


## story telling purposes only
# https://r-graph-gallery.com/piechart-ggplot2.html
piecha <- data.frame(
  split = c("train", "test"),
  percentage = c(0.75, 0.25)
)

piecha <- piecha %>%
  arrange(desc(split)) %>%
  mutate(cumulative_percentage = cumsum(percentage) - percentage / 2,
         label = paste(split, ": ", scales::percent(percentage)))

# do you like fancy pie?
ggplot(piecha , aes(x="", y=percentage, fill=split, label = label)) +
  geom_bar(stat="identity", width=2, color="white", alpha=0.6) +
  coord_polar("y", start=0) +
  scale_fill_manual(values=c("#68576D", "#DF5875"), guide = "none") +
  theme_art_nouveau() +
  geom_text(aes(y = cumulative_percentage), color = "white", size = 7, face = "bold") +
  theme(axis.title = element_blank(),
        axis.text = element_blank()) +
  labs(title = "The Data Split",
              caption = "figure 5") +
  theme(plot.title = element_text(hjust = 0.5))

Model Development

Feature Selection

It’s time to use stepwise selection to identify the best factors for the model. This iterative process adds or removes predictors based on statistical significance, preventing unnecessary complexity or overfitting (when the model excels on training data but falters on testing data). It fine-tunes the model to include only relevant predictors.

Code

# "Variable selection in R", Amelia McNamara
# https://youtu.be/IScjygOnO0w


# forward Stepwise Selection
# nvmax = 8 specifies the maximum number of predictors to incorporate in the model.
is_mod <- regsubsets(sold ~ week + month + lag1 + lag2 + rolling_4wk_avg + best_seller,
                       data = is_test, nbest = 2, method = "exhaustive")
is_mod_summary_mx <- with(summary(is_mod), data.frame(rsq,adjr2, cp, rss, outmat))

is_mod_summary_mx$predictors <- c(1,1,2,2,3,3,4,4,5,5,6) # for graphing
is_mod_summary_mx$model <- c('a','b','a','b','a','b','a','b','a', 'b', 'a') # for graphing


# "Great Looking Tables: gt", Posit
#https://posit.co/blog/great-looking-tables-gt-0-2/
# turn results into a pretty table
# is_mod_summary_mx

is_mod_summary_mx %>% 
  mutate(rsq = round(rsq, 1), 
         adjr2 = round(adjr2, 1), 
         cp = scales::comma(round(cp, 1)),
         rss = scales::comma(rss)) %>%
  rename(`4WK ROLLING AVG` = rolling_4wk_avg, `BEST SELLER` = best_seller) %>% 
  select(predictors, model, everything()) %>%
  gt(rowname_col = "predictors") %>%
  tab_header(title = md("Model Comparison, part one"),
             subtitle = "Stepwise Function Output") %>% 
  cols_align(align = "center") %>%   
  tab_options(
    grand_summary_row.background.color = "#D35C9E",
    heading.background.color = "#EFFBFC",
    column_labels.background.color = "#EFFBFC",
    stub.background.color = "#EFFBFC",
    table.font.color = "#323232",
    stub.border.style = "dashed",
    stub.border.width = "1px",
    table.width = "100%") %>%
    tab_spanner(
    label = "predictors ",
    columns = c(week, month, lag1, lag2, `4WK ROLLING AVG`, `BEST SELLER`)) %>%
  opt_all_caps() %>% 
  opt_table_font(
    font = google_font(name = "EB Garamond")) %>% 
    tab_style(
    style = cell_text(size = "24px"), 
    locations = cells_body()) %>%
  tab_source_note(source_note = md("table 1"))

Model Comparison, part one
Stepwise Function Output
	model	rsq	adjr2	cp	rss	predictors
	model	rsq	adjr2	cp	rss	week	month	lag1	lag2	4WK ROLLING AVG	BEST SELLER
1	a	0.4	0.4	3,641.2	347,410,480					*
1	b	0.1	0.1	7,962.8	499,689,200						*
2	a	0.5	0.5	1,860.3	284,585,840				*	*
2	b	0.4	0.4	2,809.3	318,026,117			*		*
3	a	0.6	0.6	7.2	219,215,369			*	*	*
3	b	0.5	0.5	1,862.2	284,582,007		*		*	*
4	a	0.6	0.6	9.1	219,211,953		*	*	*	*
4	b	0.6	0.6	9.1	219,214,795			*	*	*	*
5	a	0.6	0.6	5.0	218,998,851	*	*	*	*	*
5	b	0.6	0.6	11.0	219,211,372		*	*	*	*	*
6	a	0.6	0.6	7.0	218,998,262	*	*	*	*	*	*
table 1

table 1. provides a summary of “best” models determined by several goodness-of-fit statistics (R-square, adjusted R-square, Mallows’ Cp, and the residual sum of squares). Each row represents a model configuration, specified by the model type (a or b) and the predictor variables (week, month, lag1, lag2, 4-week rolling average, best seller). An asterisk () in the predictors’ columns indicates the predictor was included in the model.*

interpretations: RSQ, ADJR2, CP, RSS are measures for evaluating and comparing the goodness of fit or how well the model fits the inventory data. These metrics should not be used blindly but together in the context with other measures and domain knowledge.

Understanding Model Performance

R-squared values explain the variability in the outcome. Several models (3a, 4a, 4b, 5a, 5b, and 6a) are tied at .60 for RSQ, indicating that these models explain a significant portion (60%) of the variance in the outcome variable.

The adjusted R-squared is a modified version of R-squared, it incorporates the model’s degrees of freedom. The ADJR2 value only increases if the new variable improves the model more than expected by chance, Table 2. shows that models 3a, 4a, 4b, 5a, 5b, and 6a all have the highest adjusted R-squared values of 0.6, showing that these models balance model complexity and goodness-of-fit reasonably well.

The best Cp score is when the value is close to the total number of predictors. Model 5 on table 2 exhibits this with lowest Cp score of 5.0 (equal to the number of predictors present in the model).

A high RSS means the predictions are far off from the actual values, which results in large residuals. Model 6a has the lowest RSS value of 218,998,262. This indicates that among all the models, model 6a’s predictions are closest to the actual values, leading to smaller residuals.

Model Evaluation

Code

gcolors <- c("#68576D", "#DF5875")

is_mod_summary_mx %>%
  pivot_longer(c(adjr2, cp), names_to = "metric", values_to = "value") %>%
  ggplot(aes(x = predictors, y = value, color = model)) +
  geom_line(show.legend = FALSE, size = 1, alpha = .5 ) +
    geom_point(size = 4, alpha = 0.5)+
  facet_wrap(~metric, scales = "free") +
  scale_x_continuous(breaks = 1:6) +
  scale_color_manual(values = gcolors) +
  labs(x = "predictors", y = "metric", title = " Models by Number of Predictors, ADJR2 and CP",
       caption =  " figure 6") +
  theme_art_nouveau() +
  theme(
    strip.text = element_text(size = 20, face = "bold"),
    axis.title = element_text(size = 20, face = "bold")
  )

Figure 6. plots the relationship between the number of predictors, ADJR2, and Cp. As expected, as the number of predictors increase the ADJR2 increases and the Cp decreases. However, the benefits appear to taper off after the inclusion of four predictors. This potentially indicates the onset of overfitting beyond this point, suggesting that a model with three predictors could strike a balance between complexity and predictive power.

Maximum Likelihood

Model Selection Given that the response variable “sold” represents count data and has a rightward skew, a generalized linear model (GLM) with a Poisson distribution is one relevant model option. To avoid overfitting, these models will be subjected to an additional level of scrutiny before selecting the best predictors.

Code

# ISLR tidymodels labs
#https://emilhvitfeldt.github.io/ISLR-tidymodels-labs/03-linear-regression.html

# tidy model way


# model 3a
lm_3 <- glm(sold ~ lag1 + lag2 + rolling_4wk_avg,
            data = is_train,
            family = poisson)

# model 4a
lm_4 <- glm(sold ~ month + lag1 + lag2 + rolling_4wk_avg,
            data = is_train,
            family = poisson)

# model 5a
lm_5 <- glm(sold ~ month + week + lag1 + lag2 + rolling_4wk_avg,
            data = is_train,
            family = poisson)

# create a dataframe from the results from glance into a data frame
mod_compare <- rbind(broom::glance(lm_3), broom::glance(lm_4), broom::glance(lm_5))
model_names <- c("3a", "4a", "5a")
mod_compare$model_name <- model_names

# table 2

mod_compare %>%
  select(-null.deviance, -nobs, - df.null) %>%
  rename(`df residual` = df.residual) %>%
 gt(rowname_col = "model_name") %>%
  tab_header(title = "Model Comparison, part two") %>%
  cols_align(align = "center") %>%
  fmt_number(columns = everything(), decimals = 1) %>%
  tab_options(
    grand_summary_row.background.color = "#D35C9E",
    heading.background.color = "#EFFBFC",
    column_labels.background.color = "#EFFBFC",
    stub.background.color = "#EFFBFC",
    table.font.color = "#323232",
    stub.border.style = "dashed",
    table.width = "100%",
    stub.border.width = "1px") %>%
  opt_all_caps() %>%
    opt_table_font(
    font = google_font(name = "EB Garamond")) %>%
    tab_style(
    style = cell_text(size = "20px"),
    locations = cells_body()) %>%
  tab_source_note(source_note = md("table 2"))

Model Comparison, part two
	logLik	AIC	BIC	deviance	df residual
3a	−1,434,660.1	2,869,328.2	2,869,359.5	2,837,960.4	18,662.0
4a	−1,434,057.2	2,868,124.4	2,868,163.5	2,836,754.6	18,661.0
5a	−1,431,700.8	2,863,413.7	2,863,460.7	2,832,041.9	18,660.0
table 2

table 2. reveals that the LOGLIK (log-likelihood) values are very close between the three models, suggesting that all three models fit the data similarly well. Additionally the AIC values are very close supporting the trend seen in the LOGLIK, with model 3a showing the lowest overall others.

After comparing models, the predictors from 4a. will be applied to the testing set. model 4a:

avoids using the highly correlated variables of week and month concurrently (see figure 4.)
excludes unnecessary variables, where not much is gained (see figure 6.)
based on the summary output below, model 4a seems to have significant relationships (p values of less than 2e-16), between the response and predictors.

Here’s the summary of model 4a on training data:

Code

summary(lm_4)


Call:
glm(formula = sold ~ month + lag1 + lag2 + rolling_4wk_avg, family = poisson, 
    data = is_train)

Deviance Residuals: 
     Min        1Q    Median        3Q       Max  
-249.075    -8.961    -8.717    -5.442    98.437  

Coefficients:
                  Estimate Std. Error z value Pr(>|z|)    
(Intercept)      3.715e+00  2.340e-03 1587.56   <2e-16 ***
month           -1.101e-02  3.171e-04  -34.72   <2e-16 ***
lag1            -1.122e-04  1.380e-06  -81.30   <2e-16 ***
lag2            -1.269e-04  1.416e-06  -89.60   <2e-16 ***
rolling_4wk_avg  2.504e-03  1.418e-06 1765.58   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for poisson family taken to be 1)

    Null deviance: 3962544  on 18665  degrees of freedom
Residual deviance: 2836755  on 18661  degrees of freedom
AIC: 2868124

Number of Fisher Scoring iterations: 7

Analytics and Evaluation

With tidymodels functions, the formula is fitted and then applied to the testing data, the 25% of the inventory separated earlier. The results below visualize the differences between the Poisson model’s predictions and the actual weekly inventory demands.

Code

lm4_wf <-  workflow(sold ~ month + lag1 + lag2 + rolling_4wk_avg, 
                    poisson_reg())

lm4_fit <- fit(lm4_wf, data = is_train)

# Make predictions on the testing data
predictions <- predict(lm4_fit, new_data = is_test)



results <- bind_cols(predictions, is_test) %>%
  select(week, month, item_no, sold, .pred) %>%
  mutate(residulals =  .pred - sold) %>%
  pivot_longer(cols = c(sold, .pred), names_to = "type", values_to = "total_sold") %>%
  mutate(type = ifelse(type == "sold", "Actual", "Predicted"))


line2 <- results %>%
  group_by(week, type) %>%
  summarise(total_sold = sum(total_sold)) %>%
  ggplot(aes(x = week, y = total_sold, color = type)) +
  geom_line(size = 2, alpha = 0.5) +
  geom_point(size = 4, alpha = 0.5)+
  scale_x_continuous(breaks = seq(0, 55, by = 5)) + 
  scale_y_continuous(labels = scales::comma_format()) +
   
  labs(x = "week", y = "Count of Items Sold", 
       title = "Predicted vs Actual Items Sold by Week",
       subtitle = "test subset data",
        caption =  " figure 7")+
  scale_color_manual(values = c("Predicted" = "#DF5875", "Actual" = "#68576D"))+
  theme_art_nouveau()

line2

Figure 7. illustrates the comparison between the actual and predicted counts of items sold by week. The line plot demonstrates that the model’s predictions generally align with the actual sales trends, capturing the underlying pattern of the training data. However,there are areas of concern, most notably, during weeks 38 and 39, there was a significant over-prediction for item A510004 (the top item by volume) by ~6k units for both weeks. Such deviations may indicate areas where the model’s assumptions might not fully capture the underlying dynamics.

Conclusion

The Poisson regression model is more than a statistical exercise – it’s a powerful tool for strategic planning in a world with tightening external factors, where nimble, effective decision-making is no longer a competitive edge but the norm. Predictive analytics offers a sophisticated approach to understanding and anticipating consumer patterns.

What does the model reveal? In this instance, it’s moderately accurate, but those occasional misses aren’t mere mistakes; they are invitations to dig deeper. Each discrepancy in weekly sales predictions points to new paths for refinement. The missteps guide us to search for unseen variables. Is it the rhythm of seasons, the pulse of the market, or the effects of a new promotion?

Much like my leap into the domain of data science, the modeling process must be adaptive.

Sources

References

“Rollmean: Rolling Means/Maximums/Medians/Sums,” Data Camp, ND.
“Multiple Regression, AIC, AICc, and BIC Basics,” Youtube, uploaded by Foltz, Brandon, 31 Aug 2021.
“Visualization of a correlation matrix using ggplot2 in R,” GeeksforGeeks, ND.
Holtz, Yan “ggplot2 Piechart,” R Graph Gallery, ND.
Hvitfeldt, Emil, ISLR Tidymodels Labs, 6 Aug 2022.
McCormick, Keith, “Predictive Analytics Essential Training for Executives,” LinkedIn Learning, 25 Feb, 2020.
McNamara, Amelia, “Variable selection in R,” YouTube, 26 Oct, 2020.
Iannone, Richard, “Great Looking Tables: gt,” Posit, 08 April 2020
“How to create your own custom ggplot2 theme,” YouTube, uploaded by Riffomonas Project, 7 Oct 2021.
“Lead & lag R Functions of dplyr Package” YouTube, uploaded by Statistics Globe, 2 Dec 2019.
OpenAI | AI Disclosure Statement: With prior approval from my course professor, I utilized OpenAI’s tools for idea generation, proofreading, and code debugging in my work. As of August 6, 2023, Grand Valley State University (GVSU) has no official policy on AI use in academics. In lieu of direct guidance, I followed recommendations from Duke University and Scribbr’s overview of “University Policies on AI Writing Tools.” My final code was derived from both my work, the cited sources, and Stat 631 course materials. Generative AI served as one more learning resource, in addition to YouTube, human review, Posit forums, Linked-In Learning, and educational blogs. With this disclosure, I aim to contribute to the responsible and intellectually honest use of AI technology in scholarly work.

Image Credits

Packages Used:

broom (Robinson, Hayes, and Couch 2023)
GGally (Schloerke et al. 2021)
gt (Iannone et al. 2023)
leaps (Miller 2020)
MMWRweek (Niemi 2020)
pacman(Rinker and Kurkiewicz 2018)
poissonreg (Kuhn and Frick 2022)
tidymodels (Kuhn and Wickham 2020)
tidyverse (Wickham et al. 2019)

References

Iannone, Richard, Joe Cheng, Barret Schloerke, Ellis Hughes, Alexandra Lauer, and JooYoung Seo. 2023. “Gt: Easily Create Presentation-Ready Display Tables.” https://CRAN.R-project.org/package=gt.

Kuhn, Max, and Hannah Frick. 2022. “Poissonreg: Model Wrappers for Poisson Regression.” https://CRAN.R-project.org/package=poissonreg.

Kuhn, Max, and Hadley Wickham. 2020. “Tidymodels: A Collection of Packages for Modeling and Machine Learning Using Tidyverse Principles.” https://www.tidymodels.org.

Miller, Thomas Lumley based on Fortran code by Alan. 2020. “Leaps: Regression Subset Selection.” https://CRAN.R-project.org/package=leaps.

Niemi, Jarad. 2020. “MMWRweek: Convert Dates to MMWR Day, Week, and Year.” https://CRAN.R-project.org/package=MMWRweek.

Rinker, Tyler W., and Dason Kurkiewicz. 2018. “Pacman: Package Management for r.” http://github.com/trinker/pacman.

Robinson, David, Alex Hayes, and Simon Couch. 2023. “Broom: Convert Statistical Objects into Tidy Tibbles.” https://CRAN.R-project.org/package=broom.

Schloerke, Barret, Di Cook, Joseph Larmarange, Francois Briatte, Moritz Marbach, Edwin Thoen, Amos Elberg, and Jason Crowley. 2021. “GGally: Extension to ’Ggplot2’.” https://CRAN.R-project.org/package=GGally.

Wickham, Hadley, Mara Averick, Jennifer Bryan, Winston Chang, Lucy D’Agostino McGowan, Romain François, Garrett Grolemund, et al. 2019. “Welcome to the Tidyverse” 4: 1686. https://doi.org/10.21105/joss.01686.

Footnotes

STA 631 - STATISTICAL MODELING AND REGRESSION

Traditional and modern computationally intensive statistical modeling techniques. Basics of probability theory, including conditional probability, Bayes’ Theorem, and univariate probability models. Regression modeling and prediction including simple linear, multiple, logistic, Poisson, and nonlinear and nonparametric regression. Methods for model selection and shrinkage. Emphasis is on application and interpretation using statistical software. ~GVSU Course Catalog ↩︎

Project Overview

Summary

Objective

Limitations

Final Results

Initial Insights:

figure 2. shows annual sales trends on a weekly cadence. Weekly sales fluctuate within a range of 10k to 50k items, with the majority hovering between ~20k and ~30k units sold. A sales peak is observed in week 20, while week 46 registers the slowest sales activity.

Data Processing

data dictionary

Splitting the Data

Model Development

Feature Selection

Understanding Model Performance

Model Evaluation

Maximum Likelihood

table 2. reveals that the LOGLIK (log-likelihood) values are very close between the three models, suggesting that all three models fit the data similarly well. Additionally the AIC values are very close supporting the trend seen in the LOGLIK, with model 3a showing the lowest overall others.

Analytics and Evaluation

Conclusion

Sources

References

Image Credits

Packages Used:

References

Footnotes

STA 631 - STATISTICAL MODELING AND REGRESSION