This is the second take-home exercise in a series of take-home exercises for the Visual Analytics module.In this exercise, we will learn from peer and remake graphs in a more readable way of city’s demographic data.
The take-home exercise provides students the opportunity to revise and practice the R packages and programming skills we learnt in-class at home. This time,the exercise requires students to be innovative and creative by applying appropriate R packages to design enlightening and yet functional data visualization for analytics purposes. Students are encouraged to create multiple data visualization and compare our pros and cons before finalizing the best design.
Before we get started, it is important for us to ensure that the required R packages have been installed. If yes, we will load the R packages. If they have yet to be installed, we will install the R packages and load them onto R environment.
The chunk code below will do the trick.
packages = c('tidyverse','rvest','reshape2','ggtern','ggrepel','ggpubr','stringr','lubridate','ggthemes','ggplot2','webr','ggdist','hrbrthemes')
for(p in packages){
if(!require(p, character.only =T)){
install.packages(p)
}
library(p, character.only =T)
}
The code chunk below import Participants.csv from the data
folder into R by using read_csv()
of readr
and save it as a tibble data frame called participants.
participants <- read_csv("data/Participants.csv")
glimpse(participants)
Rows: 1,011
Columns: 7
$ participantId <dbl> 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13,…
$ householdSize <dbl> 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, …
$ haveKids <lgl> TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRU…
$ age <dbl> 36, 25, 35, 21, 43, 32, 26, 27, 20, 35, 48, 2…
$ educationLevel <chr> "HighSchoolOrCollege", "HighSchoolOrCollege",…
$ interestGroup <chr> "H", "B", "A", "I", "H", "D", "I", "A", "G", …
$ joviality <dbl> 0.001626703, 0.328086500, 0.393469590, 0.1380…
The original version shows percentage of household size by bar
chart.To zoom in, the main idea of this graph is to show up the
different percentages of having kids and not having kids among the
participants. And the fractions of not having kids is around 30%.
Missing ratio
However, the percentage information was not displayed in group just
describing in title.
Unclear summarization
In addition, actually the size of households having kids is all 3,
meanwhile the sizes of households without kids are 1 and 2
respectively.
To solve the problem of original design, using webr package
to build a combination of pie chart and donut plot.
Convert householdSive from numerical to character for further visualization, and make is more readable.
participants$householdSize[which(participants$householdSize == 1)]<- "Size 1"
participants$householdSize[which(participants$householdSize == 2)]<- "Size 2"
participants$householdSize[which(participants$householdSize == 3)]<- "Size 3"
participants$participantId <- as.numeric(participants$participantId)
hh_kid <- participants %>% group_by(householdSize, haveKids) %>% count(haveKids)
print(hh_kid)
# A tibble: 3 × 3
# Groups: householdSize, haveKids [3]
householdSize haveKids n
<chr> <lgl> <int>
1 Size 1 FALSE 337
2 Size 2 FALSE 373
3 Size 3 TRUE 301
PieDonut chart using webr package
PieDonut(hh_kid, aes(haveKids, householdSize, count=n),r0 = 0.7, r1=1.5, r2 = 1.9,explode = 2,
titlesize = 4,
title = "No. of People Who Have/Not Have Kids
in Different Household Size")
The main idea of original bar chart is to show basic information of
household size by count and highlight the proportion of size 2 is the
largest one. The observed fuzziness are as below:
Descending bar chart using webr package and geom_bar as below:
ggplot(data = participants,
aes(x=reorder(householdSize, householdSize, function(x)-length(x)))) +
geom_bar(fill = "steelblue4") +
ylim(0,400)+
geom_text(stat="count",
aes(label=paste0(..count.., " (",
round(..count../sum(..count..)*100,
1), "%)")),
vjust=-1) +
xlab("Household Size") +
ylab("No. of\nParticipants") +
ggtitle("Household size distribution")+
theme(panel.background= element_blank(), axis.line= element_line(color= 'grey'),
axis.title.y= element_text(angle=0))
The original design is trying to show up and compare joviality
distributions by different age ranges.
Show up statistical evidence
Display mean point and a red dashed line to display overall average
joviality score for easier measurement and comparison between age
ranges.
Remain distribution using better chart
Combining boxplot and violin charts to remain original purpose.
According to violin chart we still could identify same distribution
characters as histogram chart shows.
Distribution by age range using geom_violin and geom_boxplot as below:
ggplot(data=participants,
aes(y = joviality*100, x= Age_range)) +
ggtitle("Joviality Distribution by Age range ",subtitle = "Elder people is much unhappy") +
geom_violin(fill="skyblue",alpha=0.6) +
geom_boxplot(notch=TRUE, alpha=0.4) +
xlab("Age Range") +
ylab("Jovility Score") +
stat_summary(geom = "point",fun="mean",colour ="black",size=2.7) +
theme(panel.background= element_blank(), axis.line= element_line(color= 'grey')) +
geom_hline(aes(yintercept=mean(joviality*100)),color="red",linetype="dashed",size = 1) +
geom_text(aes(x=4.3, label="Average", y=55), colour="red",text=element_text(size=5))
Lack of confidential proof
Although the there is a vertical line in this whole raincloud graph,
there isn’t any sufficient evidence for reference and indicate that
high-educated participants are more happier.
Plus, the distribution of raincloud is not easily to read, the
supportive key point such as average score is missing.
Redundant chart
The dots below raincloud graph didn’t provide useful information for
further analysis. And it is kind of distraction for eyes and make
confusion.
Show up statistical evidence
Adding average position with dashed line to display overall average
joviality score for easier understanding and comparison between
education levels.
And the average amount is more meaningful than just picking up value
50.
Remove meaningless chart
Remove helpless dots part then add more meaningful boxplot to remain
joviality distribution.
Resort by descending order Resort by logical
order is more readable for eyes, and more clearly to clarify the
insights.
In this condition, sort by education level descending order is much
easier to get the conclusion, which is high-educated participant is
happier.
Joviality distribution by education using stat_halfeye and geom_boxplot as below:
participants$educationLevel2 <- factor(participants$educationLevel,levels = c("Low", "HighSchoolOrCollege","Bachelors","Graduate" ))
ggplot(participants,
aes(x= educationLevel2, y= joviality*100)) +
stat_halfeye(adjust = .35,
width = .8,
color = 'dodgerblue3',
justification = -.15,
position = position_nudge(x = .12)) +
geom_boxplot(width = .12, outlier.color = NA ) +
geom_hline(aes(yintercept = mean(participants$joviality*100)),
linetype= 'dashed',color= 'red',size= 0.8) +
geom_text(aes(x=4.9, label="Average", y=50),
colour="red",text=element_text(size=2)) +
coord_flip() +
labs(y = 'Joviality Score',x= 'Education Leve',
title = 'Joviality Distribution in by Education Level') +
theme(panel.background= element_blank(), axis.line= element_line(color= 'grey'),
axis.ticks.y = element_blank(),
panel.grid.major = element_line(size= 0.2, color = "grey"))
This makeover take-home exercise provides an excellent opportunity to criticize submission in terms of clarity and aesthetics meanwhile it’s a good chance to learn from peer work. My key takeaways are:
Choose proper chart can tell a better
story
Choose a proper visualization chart could make our insight clearer and
understandability.Selecting a better chart requires us to be familiar
with data characters and the insights we want to declare. Most of time,
given choices using others charts and making comparison could help us to
figure it out.
Expose supportive information
It is intuitive showing up reference value for displaying insights. And
it’s a kind of supporting for reviewer to better understanding
conclusion. Plus, this information would provide clues for further
analysis and judgement.
Exclude meaningless chart
Remove confused and redundant charts would make canvas clear and tidy so
that our eyes would not distract by interference information. Also it
would make reviewer easily get the points we try to expose.
Use appropriate theme and color The theme, color,label,ticks these supportive portions are important to our visualization.To decrease distraction and highlight insights, using proper theme, color, ticks, labels, titles could make the canvas more readable.