Sankey Diagram
Let’s start with what the Sankey diagram is… Someone working at one of IT Department come across much the type of transaction data. It might not pretty much meaningful for them, but sure that it can be taken advantage of the whole picture by someone working at Marketing or Process Department.
Well now, let’s talk about what kind type of data… Generally, those are comprised of customer events or transactions (clicks, page, waiting durations) just like the show bellowed.
When coming to the usage of the Sankey diagram at these departments as mentioned before; Marketing Analytic has a basic need like following up the customer journey after the campaign communication. Since, they try to gain new insight from data especially in the number of purchase communication packages, the prediction of churn customer activities and lastly purchase funnel so that manage to be sustainable throughout the years. Absolutely, there are many sorts of Mar-Tech tools to show the type of diagrams; Dengage, Power BI, Tableau. On the other hand, these diagrams help us to analyze business process mining as well. Especially, Average Duration interval duration in events, Bottleneck and all these some of KPI in Process plays an important key role.
The most famous Sankey diagram in history is Charles Minard’s Map of Napoleon’s 1812 Russian Campaign. I have shared an example below.
So, if we want to draw this diagram ourselves, how hard it might be for us and what kind of problems waiting for. Because Plotly Library which has able to draw Sankey Diagram and is frequently used in Python Programming Language is expected to wrangle certain input formats. Sometimes, this situation might be challenging for us. And I am sharing the basic code how to handle, with you.
Example data sample:
Subscriber: Customer ID
Event: Customer’s click path
Date: Timestamp
Thanks to the function, the dataframe will have been transformed to a suitable format to draw the diagram.
— — — — — — — — — — — — — — — — — — — — — — — — — —
def sankey(Dataframe=df, X=”SUBSCRIBER”, Y=”EVENT”, n=8, head=”Sankey Diagramı”):
#df = Dataframe[pd.notnull(X)]
df2 = Dataframe.set_index(X).groupby(X)[Y].transform(lambda x: ‘,’.join(x))\
.reset_index().drop_duplicates()#Splitting in order to be transformed to List format:
new = df2[Y].str.split(“,”).reset_index(drop=True)# Grouping of 2 elements:
b=[]
for k in range(len(new)):
for i in range(len(new[k])):
a=[]
if i < len(new[k])-1:
a.append(new[k][i])
i+=1
a.append(new[k][i])
b.append(a)
else:
break# Rename:
df_new = pd.DataFrame(b, columns=[“Source”,”Target”])# Adding new column as Source — Target
df_new[“Values”]=df_new[“Source”] + “-” + df_new[“Target”]# Making Groupby on ‘Source’:
df_new_source = df_new.groupby([“Source”])[“Values”].count().reset_index()
# Target üzerinden groupby edilmesi:
df_new_target = df_new.groupby([“Target”])[“Values”].count().reset_index()# Applying rename to use UNION function:
df_new_target.rename(columns={“Target”:”Source”},inplace=True)
df_new4 = pd.concat([df_new_source, df_new_target],
names=[‘Source’, ‘Values’], ignore_index=True)#After concating, regrouping:
df_new5 = df_new4.groupby([“Source”])[“Values”].sum().reset_index().sort_values(ascending=False, by=[“Values”])# Creating sorting column, so a path having highest value start with “0"
df_new5[“Number”] = [i for i in range(len(df_new5))]# Max=n, until the number of ’n’ shows path step.
df_new5 = df_new5[df_new5[“Number”] < n]
#Creating Dictionary.
dicti = dict(zip(df_new5[“Source”], df_new5[“Number”]))# Rename for “Source — Target” and then grouping
df_new2=df_new.groupby([“Source”,”Target”])[“Values”].count().reset_index()
# Showing paths until Max=n
df_new2 = df_new2[df_new2.index <n]df_new2[“Source”] = df_new2[“Source”].replace(dicti)
df_new2[“Target”] = df_new2[“Target”].replace(dicti)
# Importing Sankey Diagram from Plotly Library:
import plotly.graph_objects as go
fig = go.Figure(data=[go.Sankey(
arrangement = “snap”,
node = dict(
thickness = 5,
line = dict(color = “green”, width = 0.1),
label = list(df_new5[“Source”]),
color = “blue”
),
link = dict(# indices correspond to labels
source = list(df_new2[“Source”]),
target = list(df_new2[“Target”]),
value = list(df_new2[“Values”])
))])
fig.update_layout(title_text=head, font_size=10)fig.show()
— — — — — — — — — — — — — — — — — — — — — — — — —
Testing:
Let’s check the function
sankey(Dataframe=df,X=”SUBSCRIBER”,Y=”EVENT”, n=5, head =”Sankey Diagram”)