SCATTER DIAGRAM¶
A scatter diagram (also called a scatter plot or scatter graph) is a type of graph that shows the relationship between two numerical variables. It consists of dots that represent individual data points, where:
- The x-axis represents one variable.
- The y-axis represents another variable.
Each dot represents a data point with its x and y values.
Purpose of a Scatter Diagram¶
Identifies relationships - Shows correlation between two variables (positive, negative, or no correlation).
Detects patterns or trends – Helps in understanding how one variable changes with another.
Highlights outliers – Unusual data points that don’t follow the trend.
Types of Correlation in Scatter Plots¶
Positive correlation (upward trend) – As one variable increases, the other increases.
Negative correlation (downward trend) – As one variable increases, the other decreases.
No correlation – No clear pattern; the variables are unrelated.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.datasets import load_iris
To plot a scatter diagram, we first need to load our desired dataset.
iris_data=load_iris()
iris=pd.DataFrame(data=iris_data.data,columns=iris_data.feature_names)
iris = pd.DataFrame(data=iris_data.data, columns=iris_data.feature_names): This line of code transforms the Iris dataset into a Pandas DataFrame for easier data manipulation and analysis. It uses the pd.DataFrame() function, supplying raw data from iris_data.data and setting column names from iris_data.feature_names. This results in a structured tabular format stored in the variable iris, enabling convenient data handling and visualization.
- Then we adds a new column named 'species' to the iris DataFrame.
iris['species']=pd.Categorical.from_codes(iris_data.target,iris_data.target_names)
The code pd.Categorical.from_codes to create a categorical variable representing the species of each iris flower. The species are determined by the iris_data.target values, and their corresponding names are provided by iris_data.target_names. This categorical column enhances data analysis and visualization by providing a structured representation of the species information.
iris.describe() #To get summary
sepal length (cm) | sepal width (cm) | petal length (cm) | petal width (cm) | |
---|---|---|---|---|
count | 150.000000 | 150.000000 | 150.000000 | 150.000000 |
mean | 5.843333 | 3.057333 | 3.758000 | 1.199333 |
std | 0.828066 | 0.435866 | 1.765298 | 0.762238 |
min | 4.300000 | 2.000000 | 1.000000 | 0.100000 |
25% | 5.100000 | 2.800000 | 1.600000 | 0.300000 |
50% | 5.800000 | 3.000000 | 4.350000 | 1.300000 |
75% | 6.400000 | 3.300000 | 5.100000 | 1.800000 |
max | 7.900000 | 4.400000 | 6.900000 | 2.500000 |
The below code snippet generates a scatter plot using Matplotlib to visualize the relationship between sepal length and sepal width in the Iris dataset. It uses plt.scatter to plot data points, with colors determined by iris_data.target (representing different species). Axis labels and a title are added for clarity, and finally, plt.show displays the plot.
plt.scatter(iris['sepal length (cm)'], iris['sepal width (cm)'], c=iris_data.target)
plt.xlabel('Sepal Length (cm)')
plt.ylabel('Sepal Width (cm)')
plt.title('Iris Scatter Plot')
plt.show()
Using the code below, define a list of desired colors and a list of labels for the legend in our scatter plot .
species_names = iris_data.target_names
colors=['red','green','blue']
The below code snippet iterates through each species in the Iris dataset, filters the data for that species, and then plots the sepal length and sepal width using custom colors defined earlier. It adds labels for each species and displays a legend to distinguish them on the scatter plot. Finally, it sets axis labels and a title to enhance clarity and shows the plot using plt.show(). This customization enables a visually clear and informative representation of the Iris data, highlighting the differences between species based on sepal dimensions.
for i, species in enumerate(species_names):
species_data = iris[iris['species'] == species] # Filter data for current species
plt.scatter(species_data['sepal length (cm)'],
species_data['sepal width (cm)'],
c=colors[i], label=species) # Use custom colors and label
plt.xlabel('Sepal Length (cm)')
plt.ylabel('Sepal Width (cm)')
plt.title('Iris Scatter Plot with Custom Colors')
plt.legend()
plt.show()
The next step, we customize the appearance and placement of the legend
species_names = iris_data.target_names
colors = ['red', 'green', 'blue']
for i, species in enumerate(species_names):
species_data = iris[iris['species'] == species]
plt.scatter(species_data['sepal length (cm)'],
species_data['sepal width (cm)'],
c=colors[i], label=species)
plt.xlabel('Sepal Length (cm)')
plt.ylabel('Sepal Width (cm)')
plt.title('Iris Scatter Plot with Legend Below')
plt.legend(loc='upper center', bbox_to_anchor=(0.5, -0.1), ncol=len(species_names))
plt.show()
#fig.update_layout(title={'text':'Iris Scatter Plot with Legend Below','y':0.95,'x':0.5,'xanchor':'center','yanchor':'top'})
plt.legend(loc='upper center', bbox_to_anchor=(0.5, -0.1), ncol=len(species_names) is customizes the placement and layout of the legend in a Matplotlib plot. It positions the legend below the plot, centered horizontally, and arranges legend items in a single row, spanning the entire width. This configuration enhances the plot's visual appeal and readability, preventing the legend from overlapping with the plotted data.
loc='upper center' starts by placing the legend at the upper center, then the following argument adjusts the placement.
bbox_to_anchor=(0.5, -0.1) fine-tunes the legend's position. (0.5, -0.1) refers to coordinates relative to the plot, where (0.5, 0) is the center. A negative y-value moves the legend below the plot.
ncol=len(species_names) sets the number of columns in the legend. By default, Matplotlib arranges items in a single column. Using len(species_names) ensures that all legend items fit in a single row, spanning the width of the plot.
**Then we calculates the coefficients for a linear regression (polynomial of degree 1) between sepal length and sepal width using the Ordinary Least Squares (OLS) method.Here np.polyfit function from Numpy to perform a linear regression on the data and 1 specifies the degree of the polynomial. **
ls_fit = np.polyfit(iris['sepal length (cm)'], iris['sepal width (cm)'], 1)
coefficients = ls_fit
print(coefficients)
[-0.0618848 3.41894684]
import warnings
warnings.filterwarnings("ignore", message="The default of observed=False is deprecated", category=FutureWarning)
iris['count'] = iris.groupby('species')['species'].transform('count')
This line of code calculates the number of occurrences of each species in the iris DataFrame and adds this information as a new column called 'count'.
import plotly.express as px brings in a powerful and user-friendly tool for creating interactive visualizations within your code. It allows you to leverage the capabilities of Plotly Express using the convenient alias px.The remaining code creates an interactive scatter plot using Plotly Express to visualize the relationship between sepal length and sepal width in the Iris dataset, with customizations for color, symbols, trendline, hover data, and title. It then displays the plot.
import plotly.express as px
fig = px.scatter(iris, x='sepal length (cm)',y='sepal width (cm)',
color='species',
symbol='species',
trendline="ols",
trendline_color_override="black",
hover_data=['species', 'count'],
title="Iris Data")
fig.show()
symbol='species' differentiates data points visually by species using different shapes. trendline_color_override="black" sets the trendline color to black. hover_data=['species', 'count'] displays species and count information when hovering over data points.
To color the regression line to match the species, we omit the code trendline_color_override="black".
import plotly.express as px
fig = px.scatter(iris, x='sepal length (cm)', y='sepal width (cm)', color='species', symbol='species',
trendline="ols",hover_data=['species', 'count'],title="Iris Data")
fig.show()
Then we customizes the colors of lines in a Plotly Express figure based on the species they represent. It uses a color_map dictionary to define the desired colors for each species.
import plotly.express as px
fig = px.scatter(
iris,
x='sepal length (cm)',
y='sepal width (cm)',
color='species',
symbol='species',
trendline="ols",
hover_data=['species', 'count'],
title="Iris Data"
)
color_map = {'setosa': 'blue', 'versicolor': 'red', 'virginica': 'green'}
for species, color in color_map.items():
fig.update_traces(
selector={'name': species}, line_color=color)
fig.show()
1.for species, color in color_map.items():This line starts a loop that iterates through the key-value pairs in the color_map dictionary.
2.fig.update_traces(selector={'name': species}):This line is supposed to update the color of the trace associated with the current species in the loop.
3.line_color=color: It specifies that the line_color property of the selected trace should be set to the color value from the color_map dictionary.
Then, apply the update_layout method to customize the plot's layout, specifically the legend and title.
fig.update_layout(legend=dict(
orientation="h",
yanchor="bottom",
y=-0.2,
xanchor="center",
x=0.5 ),
title={'text': "Iris Data with Custom Regression Line Colors",
'y':0.9,
'x':0.5,
'xanchor': 'center',
'yanchor': 'top'
}
)
fig.show()
1.legend=dict(...): Customizes the legend.
orientation="h": Sets the legend orientation to horizontal.
yanchor="bottom": Anchors the legend to the bottom of the plot.
y=-0.2: Adjusts the vertical position of the legend.
xanchor="center": Anchors the legend to the center horizontally.
x=0.5: Adjusts the horizontal position of the legend (0.5 centers it).
2.title={'text': ..., ...}: Customizes the title.
'text': "Iris Data with Custom Regression Line Colors": Sets the title text.
'y':0.9: Adjusts the vertical position of the title.
'x':0.5: Adjusts the horizontal position of the title.
'xanchor': 'center': Anchors the title to the center horizontally.
'yanchor': 'top': Anchors the title to the top vertically.
!jupyter nbconvert --to html Scatter_Plot.ipynb
[NbConvertApp] Converting notebook Scatter_Plot.ipynb to html [NbConvertApp] WARNING | Alternative text is missing on 3 image(s). [NbConvertApp] Writing 532380 bytes to Scatter_Plot.html