This is the 6th take-home exercise in a series of take-home exercises for the Visual Analytics module.In this exercise, we are required to reveal the patterns of community interactions of the city of Engagement, Ohio USA by using social network analysis approach.
The take-home exercise provides students the opportunity to revise and practice the R packages and programming skills we learnt in-class at home. This time,the exercise requires students to be innovative and creative by applying appropriate R packages to design enlightening and yet functional data visualization for analytics purposes. Students are encouraged to create multiple data visualization and compare our pros and cons before finalizing the best design.
Before we get started, it is important for us to ensure that the required R packages have been installed. If yes, we will load the R packages. If they have yet to be installed, we will install the R packages and load them onto R environment.
packages = c('igraph', 'tidygraph',
'ggraph', 'visNetwork',
'lubridate', 'clock','patchwork',
'tidyverse', 'graphlayouts','stringr')
for(p in packages){
if(!require(p, character.only = T)){
install.packages(p)
}
library(p, character.only = T)
}
The code chunk below import SocialNetwork.csv from the data
folder into R by using read_csv()
of readr
and save it as a tibble data frame called social_node.
social <- read_csv("data/SocialNetwork.csv")
p <- read_csv("data/Participants.csv")
As original data set is too big to run successfully, I choose a whole month and participant 0 to 50 for further insights exploration.
Mar_social_edge <- social %>% # Extract 50 Participant
select(timestamp = timestamp, source = participantIdFrom, target = participantIdTo) %>%
mutate(Weekday = wday(timestamp, label = TRUE,abbr = FALSE)) %>%
mutate(YearMon = format(as.Date(timestamp), "%Y-%m")) %>%
filter(YearMon == "2022-03")%>%
filter(source < 50)
Mar50_social_edge<- Mar_social_edge %>%
filter(source<50) %>%
mutate(dayType = case_when(Weekday %in% c("Saturday","Sunday")~"Weekend")) %>%
mutate(dayType = coalesce(dayType, "Weekday"))
write_rds(Mar50_social_edge,"data/Mar50_social_edge.rds")
Mar50_social_edge <- read_rds("data/Mar50_social_edge.rds")
MAR50_social_calculation <- Mar50_social_edge %>%
group_by(source, target) %>%
summarise(Weight = n())
median(MAR50_social_calculation$Weight)
[1] 12
MAR50_social_edge_aggregate <- Mar50_social_edge %>%
group_by(source, target) %>%
summarise(Weight = n()) %>%
filter(source!=target) %>%
filter(Weight > 11) %>%
ungroup()
social_MAR50_node <- MAR50_social_edge_aggregate %>%
pivot_longer(cols = 1:2,
names_to = "target",
values_to = "source")%>%
distinct(source) %>%
left_join(y = p, by = c("source" = "participantId")) %>%
rename(Id = source) %>%
mutate(householdSize = as.character(factor(householdSize)))%>%
mutate(P_ID = Id)
MAR_50_graph <- graph_from_data_frame(MAR50_social_edge_aggregate,
vertices = social_MAR50_node) %>%
as_tbl_graph()
ggraph(MAR_50_graph, layout = "fr") +
geom_edge_arc(strength = 0.2, width = 0.5, alpha = 0.15) +
geom_node_point(aes(size = joviality, color = factor(interestGroup))) +
geom_node_text(aes(label = P_ID, size = 0.3), repel = TRUE) +
theme_void() +
theme(legend.position = "none") +
ggtitle("March 2022 overall social connection",subtitle = "nodes sorting by joviality") +
theme(legend.position = 'bottom')+
theme(plot.title = element_text(size = 17, face = "bold"))
As above social interaction plot shows, the node size is based joviality. The higher joviality it is, the bigger size the node is. A cluster at the bottom with central points are all with lower joviality score which attracting curiosity. So we will step in and to explore further insights.
Cluster member participant Id as following : 980, 844, 1, 59, 38, 951, 633, 584, 49, 765, 931
cluster_graph <- graph_from_data_frame(cluster_edge_aggregate,
vertices = cluster_node) %>%
as_tbl_graph()
From an overall view, we could observe that this cluster connect with each other frequently. However, in a way, participant 38 and 59, participant 1 and 951 seems contact less relatively. And this cluster seems could be treated as 2 portions, one portion is left side, start from participant 38. The other portion is right side, start from participant 59.
On the other hand, it seems the frequent connection not that strictly along with the age. As you may refer to following plot. On contrary, participant 38 and 59 have similar ages (34 and 35 respectively), they didn’t contact with each other more frequent as assumption.
g_ori<- ggraph(cluster_graph, layout = "stress") +
geom_edge_link(aes(width = Weight, alpha = 0.4)) +
geom_node_point(aes(size = joviality*1.2, color = factor(interestGroup))) +
geom_node_text(aes(label = P_ID, size = 0.7), repel = TRUE) +
theme_void() +
ggtitle("Overall social connection of cluster by ID",subtitle = "by Participant ID") +
theme(legend.position = 'bottom')+
theme(plot.title = element_text(size = 17, face = "bold"))
g_age<- ggraph(cluster_graph, layout = "stress") +
geom_edge_link(aes(width = Weight, alpha = 0.4)) +
geom_node_point(aes(size = joviality*1.2, color = factor(interestGroup))) +
geom_node_text(aes(label = age, size = 0.7), repel = TRUE) +
theme_void() +
theme(legend.position = "none") +
ggtitle("Is frequent connection related to Age?", subtitle = "Display node by Age") +
theme(legend.position = 'bottom')+
theme(plot.title = element_text(size = 17, face = "bold"))
(g_ori)/(g_age)
cluster_byDay_aggregate <- Mar50_social_edge %>%
filter(source %in% c(1,49,38)) %>%
filter (target %in% c(980,844,59,951,633,584,765,931)) %>%
group_by(source, target,Weekday) %>%
summarise(Weight = n()) %>%
filter(source!=target) %>%
ungroup()
cluster_byDay_graph <- graph_from_data_frame(cluster_byDay_aggregate,
vertices = cluster_node) %>%
as_tbl_graph()
Accordingly to specific weekdays we may notice on both continuous 2 days, Saturday and Sunday participant 49 contact less with participant 584. Suppose they may usually work together,so get in touch more on workdays.
set_graph_style()
byDay <- ggraph(cluster_byDay_graph,
layout = "fr") +
geom_edge_link(aes(width=Weight),
alpha=0.2) +
scale_edge_width(range = c(0.1, 5)) +
geom_node_point(aes(size = joviality,color = factor(interestGroup))) +
geom_node_text(aes(label = P_ID, size = 0.2), repel = TRUE)+
ggtitle("Cluster social connection differs by day within weeks") +
theme(plot.title = element_text(size = 15, face = "bold"))
byDay + facet_edges(~Weekday,scales="free")+
th_foreground(foreground = "grey80",
border = TRUE) +
theme(legend.position = 'bottom')
Then we need additional data to explore potential reason. The additional information is imported from the ParticipantStatusLog in order to study the attributes associated with the participants and potential intersections of the period.
log <- read_csv("data/ParticipantStatusLogs1.csv") %>%
select(participantId, apartmentId, jobId) %>%
group_by(participantId, apartmentId, jobId)%>%
summarise(count = n())
write_rds(log,"data/log_modified.rds")
log_modified <- read_rds("data/log_modified.rds")
jobs <- read_csv("data/Jobs.csv") %>%
select(jobId, employerId)
modified_node <- social_MAR50_node %>%
left_join(y = log_modified, by = c("Id" = "participantId")) %>%
filter(P_ID %in% c(980,844,1,59,38,951,633,584,49,765,931)) %>%
left_join(y= jobs, by = "jobId") %>%
select(-count)
Based on below colored social connection plot, we could notice that participant 38, 49, 584, 633 are employed by same company (all nodes are in green). In this condition, we may assume part of daily frequent interaction is caused by work related requirement.
MAR_further_graph <- graph_from_data_frame(cluster_edge_aggregate,
vertices = modified_node) %>%
as_tbl_graph()
ggraph(MAR_further_graph, layout = "stress") +
geom_edge_link(aes(width = Weight, alpha = 0.4)) +
geom_node_point(aes(size = 1, color = factor(employerId))) +
geom_node_text(aes(label = P_ID), repel = TRUE) +
theme_void() +
ggtitle("Cluster connection by ID",subtitle = "Colored by Employer") +
theme(legend.position = 'bottom')
Based on above exploration, it seems only working interaction won’t contribute the most frequent connection with each other. Besides working activity, it seems other daily action contribute to frequent social connection as well. Let’s dive deeper into participant daily check-in log of participant 49.
checkin <- read_csv("data/CheckinJournal.csv")
checkin <- read_csv("data/CheckinJournal.csv") %>%
mutate(YearMon = format(as.Date(timestamp), "%Y-%m")) %>%
filter(venueType %in% c("Pub","Restaurant")) %>%
filter(YearMon == "2022-03")
write_rds(checkin,"data/checkinMar.rds")
checkinMar <- read_rds("data/checkinMar.rds")
check_selected <- checkinMar %>%
filter(participantId %in% c(38,633,584,49,931,765)) %>%
mutate(venueId = as.character(factor(venueId))) %>%
mutate( participantId= as.character(factor(participantId)))
check_pub <- check_selected %>%
filter(venueType == "Pub")
check_rest <- check_selected %>%
filter(venueType == "Restaurant")
We select his/ her the most frequent showing up site from the log to cross check with other linked participants. The most popular entertainment space to 49 is Restaurant 1805 and Pub 1799. Then following bar charts mention that, the two people 931 and 765 who contact the most frequent with participant 49, appearing at 49 favorite Pub and Restaurant the most correspondingly as well.
ggplot(data = check_pub,
aes(x = venueId)) +
geom_bar(fill = "steelblue4") +
ylim(0,100) +
geom_text(stat="count",
aes(label=paste0(..count.. )),
vjust=-0.8) +
xlab("Pub Id") +
ylab("No.of Check in March ") +
ggtitle("Situation of the Cluster check-in Pub 1799")+
theme(panel.background= element_blank(), axis.line= element_line(color= 'grey'),
axis.title.y= element_text(angle=0))+
facet_grid( participantId~.)+
theme(plot.title = element_text(size = 15, face = "bold"))
ggplot(data = check_rest,
aes(x = venueId)) +
geom_bar(fill = "steelblue4") +
ylim(0,80) +
geom_text(stat="count",
aes(label=paste0(..count.. )),
vjust=-0.8) +
xlab("Restaurant Id") +
ylab("No.of Check in March ") +
facet_grid( participantId~.)+
ggtitle("Situation of the Cluster check-in Restaurant 1805")+
theme(panel.background= element_blank(), axis.line= element_line(color= 'grey'),
axis.title.y= element_text(angle=0))+
theme(plot.title = element_text(size = 15, face = "bold"))
Participants who work together may get a higher chance to connect more frequently.
Participants showing off in same entertainment location, will get a higher chance to social more frequent than others.