Take-home Exercise 6

The take-home exercise provides students the opportunity to revise and practice the R packages and programming skills we learnt in-class at home. This time,the exercise requires students to be innovative and creative by applying appropriate R packages to design enlightening and yet functional data visualization for analytics purposes. Students are encouraged to create multiple data visualization and compare our pros and cons before finalizing the best design.

2. Getting Started

Before we get started, it is important for us to ensure that the required R packages have been installed. If yes, we will load the R packages. If they have yet to be installed, we will install the R packages and load them onto R environment.

packages = c('igraph', 'tidygraph', 
             'ggraph', 'visNetwork', 
             'lubridate', 'clock','patchwork',
             'tidyverse', 'graphlayouts','stringr')
for(p in packages){
  if(!require(p, character.only = T)){
    install.packages(p)
  }
  library(p, character.only = T)
}

3. Importing Data

The code chunk below import SocialNetwork.csv from the data folder into R by using read_csv() of readr and save it as a tibble data frame called social_node.

social <- read_csv("data/SocialNetwork.csv")

p <- read_csv("data/Participants.csv")

4. Data Preparation

Data wrangling for social edge

As original data set is too big to run successfully, I choose a whole month and participant 0 to 50 for further insights exploration.

Mar_social_edge <- social %>%    # Extract 50 Participant
  select(timestamp = timestamp, source = participantIdFrom, target = participantIdTo) %>%
  mutate(Weekday = wday(timestamp, label = TRUE,abbr = FALSE)) %>%
  mutate(YearMon = format(as.Date(timestamp), "%Y-%m")) %>%
  filter(YearMon == "2022-03")%>%
  filter(source < 50)

Mar50_social_edge<- Mar_social_edge %>%
  filter(source<50) %>%
  mutate(dayType = case_when(Weekday %in% c("Saturday","Sunday")~"Weekend")) %>%
  mutate(dayType = coalesce(dayType, "Weekday"))

write_rds(Mar50_social_edge,"data/Mar50_social_edge.rds")

Mar50_social_edge <- read_rds("data/Mar50_social_edge.rds")

March 2022 Social_edge extraction

MAR50_social_calculation <- Mar50_social_edge %>%  
  group_by(source, target) %>%
    summarise(Weight = n())

median(MAR50_social_calculation$Weight)

MAR50_social_edge_aggregate <- Mar50_social_edge %>%  
  group_by(source, target) %>%
    summarise(Weight = n()) %>%
  filter(source!=target) %>%
  filter(Weight > 11) %>%
  ungroup()

Nodes extraction for March 2022

social_MAR50_node <- MAR50_social_edge_aggregate %>%
  pivot_longer(cols = 1:2,
               names_to = "target",
               values_to = "source")%>%
  distinct(source) %>%
  left_join(y = p, by = c("source" = "participantId")) %>%
  rename(Id = source) %>%
  mutate(householdSize = as.character(factor(householdSize)))%>%
  mutate(P_ID = Id)

Build a graph data

MAR_50_graph <- graph_from_data_frame(MAR50_social_edge_aggregate,
                                 vertices = social_MAR50_node) %>%
  as_tbl_graph()

March Overall interaction by Participant ID

ggraph(MAR_50_graph, layout = "fr") +
  geom_edge_arc(strength = 0.2, width = 0.5, alpha = 0.15) + 
  geom_node_point(aes(size = joviality, color = factor(interestGroup))) + 
  geom_node_text(aes(label = P_ID, size = 0.3), repel = TRUE) +
  theme_void() +
  theme(legend.position = "none") +
  ggtitle("March 2022 overall social connection",subtitle = "nodes sorting by joviality") +
  theme(legend.position = 'bottom')+
  theme(plot.title = element_text(size = 17, face = "bold"))

Specific cluster selection

As above social interaction plot shows, the node size is based joviality. The higher joviality it is, the bigger size the node is. A cluster at the bottom with central points are all with lower joviality score which attracting curiosity. So we will step in and to explore further insights.

Cluster member participant Id as following : 980, 844, 1, 59, 38, 951, 633, 584, 49, 765, 931

cluster_node <- social_MAR50_node %>%
  filter(Id %in% c(980,844,1,59,38,951,633,584,49,765,931))

cluster_edge_aggregate <- MAR50_social_edge_aggregate %>%
  filter(source %in% c(1,49,38))

cluster_graph <- graph_from_data_frame(cluster_edge_aggregate,
                                 vertices = cluster_node) %>%
  as_tbl_graph()

Overall social connection visualization for the cluster

From an overall view, we could observe that this cluster connect with each other frequently. However, in a way, participant 38 and 59, participant 1 and 951 seems contact less relatively. And this cluster seems could be treated as 2 portions, one portion is left side, start from participant 38. The other portion is right side, start from participant 59.

On the other hand, it seems the frequent connection not that strictly along with the age. As you may refer to following plot. On contrary, participant 38 and 59 have similar ages (34 and 35 respectively), they didn’t contact with each other more frequent as assumption.

g_ori<- ggraph(cluster_graph, layout = "stress") +
  geom_edge_link(aes(width = Weight, alpha = 0.4)) + 
  geom_node_point(aes(size = joviality*1.2, color = factor(interestGroup))) + 
  geom_node_text(aes(label = P_ID, size = 0.7), repel = TRUE) +
  theme_void() +
  ggtitle("Overall social connection of cluster by ID",subtitle = "by Participant ID") +
  theme(legend.position = 'bottom')+
  theme(plot.title = element_text(size = 17, face = "bold"))


g_age<- ggraph(cluster_graph, layout = "stress") +
  geom_edge_link(aes(width = Weight, alpha = 0.4)) + 
  geom_node_point(aes(size = joviality*1.2, color = factor(interestGroup))) + 
  geom_node_text(aes(label = age, size = 0.7), repel = TRUE) +
  theme_void() +
  theme(legend.position = "none") +
  ggtitle("Is frequent connection related to Age?", subtitle = "Display node by Age") +
  theme(legend.position = 'bottom')+
  theme(plot.title = element_text(size = 17, face = "bold"))

(g_ori)/(g_age)

Social connection observation by weekday for the cluster

cluster_byDay_aggregate <- Mar50_social_edge %>%
   filter(source %in% c(1,49,38)) %>%
  filter (target %in% c(980,844,59,951,633,584,765,931)) %>%
  group_by(source, target,Weekday) %>%
    summarise(Weight = n()) %>%
  filter(source!=target) %>%
  ungroup()

cluster_byDay_graph <- graph_from_data_frame(cluster_byDay_aggregate,
                                 vertices = cluster_node) %>%
  as_tbl_graph()

Accordingly to specific weekdays we may notice on both continuous 2 days, Saturday and Sunday participant 49 contact less with participant 584. Suppose they may usually work together,so get in touch more on workdays.

set_graph_style()
byDay <- ggraph(cluster_byDay_graph, 
            layout = "fr") + 
  geom_edge_link(aes(width=Weight), 
                 alpha=0.2) +
  scale_edge_width(range = c(0.1, 5)) +
  geom_node_point(aes(size = joviality,color = factor(interestGroup))) +
  geom_node_text(aes(label = P_ID, size = 0.2), repel = TRUE)+
  ggtitle("Cluster social connection differs by day within weeks") +
  theme(plot.title = element_text(size = 15, face = "bold"))


byDay + facet_edges(~Weekday,scales="free")+
  th_foreground(foreground = "grey80",  
                border = TRUE) +
  theme(legend.position = 'bottom')

Further Data Enhancement

Then we need additional data to explore potential reason. The additional information is imported from the ParticipantStatusLog in order to study the attributes associated with the participants and potential intersections of the period.

log <- read_csv("data/ParticipantStatusLogs1.csv") %>%
  select(participantId, apartmentId, jobId) %>%
  group_by(participantId, apartmentId, jobId)%>%
  summarise(count = n())

write_rds(log,"data/log_modified.rds")

log_modified <- read_rds("data/log_modified.rds")

jobs <- read_csv("data/Jobs.csv") %>%
  select(jobId, employerId)

modified_node <- social_MAR50_node %>%
  left_join(y = log_modified, by = c("Id" = "participantId")) %>%
  filter(P_ID %in% c(980,844,1,59,38,951,633,584,49,765,931)) %>%
  left_join(y= jobs, by = "jobId") %>%
  select(-count)

Work relationship exploration

Based on below colored social connection plot, we could notice that participant 38, 49, 584, 633 are employed by same company (all nodes are in green). In this condition, we may assume part of daily frequent interaction is caused by work related requirement.

MAR_further_graph <- graph_from_data_frame(cluster_edge_aggregate,
                                 vertices = modified_node) %>%
  as_tbl_graph()

ggraph(MAR_further_graph, layout = "stress") +
  geom_edge_link(aes(width = Weight, alpha = 0.4)) + 
  geom_node_point(aes(size = 1, color = factor(employerId))) + 
  geom_node_text(aes(label = P_ID), repel = TRUE) +
  theme_void() +
  ggtitle("Cluster connection by ID",subtitle = "Colored by Employer") +
  theme(legend.position = 'bottom')

Entertainment relationship exploration

Based on above exploration, it seems only working interaction won’t contribute the most frequent connection with each other. Besides working activity, it seems other daily action contribute to frequent social connection as well. Let’s dive deeper into participant daily check-in log of participant 49.

checkin <- read_csv("data/CheckinJournal.csv")

checkin <- read_csv("data/CheckinJournal.csv") %>%
  mutate(YearMon = format(as.Date(timestamp), "%Y-%m")) %>%
  filter(venueType %in% c("Pub","Restaurant")) %>%
  filter(YearMon == "2022-03")

write_rds(checkin,"data/checkinMar.rds")

checkinMar <- read_rds("data/checkinMar.rds")

check_selected <- checkinMar %>%
  filter(participantId %in% c(38,633,584,49,931,765)) %>%
  mutate(venueId = as.character(factor(venueId))) %>%
  mutate( participantId= as.character(factor(participantId)))

check_pub <- check_selected %>%
  filter(venueType == "Pub")

check_rest <- check_selected %>%
  filter(venueType == "Restaurant")

We select his/ her the most frequent showing up site from the log to cross check with other linked participants. The most popular entertainment space to 49 is Restaurant 1805 and Pub 1799. Then following bar charts mention that, the two people 931 and 765 who contact the most frequent with participant 49, appearing at 49 favorite Pub and Restaurant the most correspondingly as well.

ggplot(data = check_pub,
aes(x = venueId)) +
geom_bar(fill = "steelblue4") +
ylim(0,100) +
geom_text(stat="count", 
      aes(label=paste0(..count.. )),
      vjust=-0.8) +
  xlab("Pub Id") +
  ylab("No.of Check in March ") +
  ggtitle("Situation of the Cluster check-in Pub 1799")+
  theme(panel.background= element_blank(), axis.line= element_line(color= 'grey'),
        axis.title.y= element_text(angle=0))+
  facet_grid( participantId~.)+
  theme(plot.title = element_text(size = 15, face = "bold"))

ggplot(data = check_rest,
aes(x = venueId)) +
geom_bar(fill = "steelblue4") +
ylim(0,80) +
geom_text(stat="count", 
      aes(label=paste0(..count.. )),
      vjust=-0.8) +
  xlab("Restaurant Id") +
  ylab("No.of Check in March ") +
  facet_grid( participantId~.)+
  ggtitle("Situation of the Cluster check-in Restaurant 1805")+
  theme(panel.background= element_blank(), axis.line= element_line(color= 'grey'),
        axis.title.y= element_text(angle=0))+
  theme(plot.title = element_text(size = 15, face = "bold"))

Take-home Exercise 6

Author

Affiliation

Published

DOI

1. Overview