Introduction
For a lot of purposes, together with on-line customer support, advertising, and finance, gender identification based mostly on names is an important problem. Given numerous gender choices and the variability of languages, it may be tough to provide you with a reputation gender id classification system that’s correct throughout all languages. This text will talk about how NLP and Python can resolve this drawback. Right here we are going to take care of figuring out gender, based mostly on Indian names.
By the tip of this text, you’ll have realized how one can:
Use the nltk library for NLP duties to transform the uncooked textual content into vector representations.
Construct a name-based gender identification mannequin utilizing NLP Pipeline.
Use numerous ML, NLP, and Deep Studying algorithms and determine the best-performing one.

This text was printed as part of the Knowledge Science Blogathon.
Desk of Contents
Drawback Assertion
Pre-requisites
Proposed Resolution
Description of the Dataset
The Instinct of the Algorithms
Methodology
Code Implementation
LSTM
Conclusion
Drawback Assertion
The enterprise drawback that we’re going to resolve is as follows with NLP pipeline steps:
“Given the title, determine the gender of the particular person”
Pre-requisites
This a beginners-level NLP venture and requires an understanding of the next ideas:
Python
Pandas library for knowledge dealing with
Matplotlib or Seaborn for knowledge visualizations
Fundamentals of machine studying and deep studying algorithms
Proposed Resolution
The urged repair for this drawback is to construct a name-based gender identification system combining deep studying and machine studying. Though we all know there are greater than two genders, we are going to solely think about ‘Male’ and ‘Feminine.’ Therefore this turns into a binary classification mannequin.
Description of the Dataset
For this venture, we’re going to use the Gender_Data dataset accessible on Kaggle.
This dataset accommodates a complete of 53925 Indian names, of which 29014 are male, and the remaining are feminine. The ‘Gender’ attribute accommodates the values 0 and 1. 0 corresponds to a boy’s title, whereas 1 represents a feminine.
The Instinct of the Algorithms
On this part, we’re going to have a look at the NLP ideas and different subjects that we will use in constructing this venture.
Label Encoding: This refers back to the means of changing categorical labels into numeric labels. Right here every categorical label is given a particular worth based mostly on its alphabetical ordering.
Depend Vectorization: Depend vectorization is the method the place all of the phrases within the corpus are transformed into numerical knowledge based mostly on their frequency within the corpus. It converts textual knowledge right into a sparse matrix. Allow us to vectorize the given an instance:
textual content = [ ‘this is an example, ‘An ant ate the apple’ ]
this
is
an
instance
ant
ate
the
apple
0
1
1
1
1
0
0
0
0
1
0
0
1
0
1
1
1
1
Logistic Regression
Logistic regression is without doubt one of the mostly used machine studying algorithms for fixing classification issues. It’s used to foretell the chance of a sure worth belonging to a sure class. It tells the chance of an information level belonging to class 0 or class 1. It really works based mostly on a sigmoid perform. Logistic regression matches the linear regression curve into the sigmoid perform, producing an “S”-shaped curve. Right here, a threshold level (ideally 0.5) is used to differentiate the courses.

Naïve Bayes
Naïve Bayes is a supervised studying algorithm extensively used to categorise texts and high-dimensional coaching knowledge. It’s able to making very fast choices and therefore takes minimal coaching and testing time. It’s named ‘Naïve’ as a result of it assumes that the incidence of 1 worth is fully impartial of the opposite values within the dataset. It really works based mostly on Bayes theorem, which is as follows:
P(A/B) = [P(A) * P(B/A)] / P(B)
the place P(A) is the posterior chance, P(B) is the marginal chance , P(A) is the prior chance and P(B/A) is the chance.

Naive Bayes is a quick and simple algorithm that can be utilized for each binary and multiclass classification issues. Nevertheless, it presumes that the dataset’s options are uncorrelated, making it tough to be taught the connection between the variables.
XGBoost
XGBoost is without doubt one of the strongest machine studying algorithms in use in the present day. It stands for eXtreme Gradient Boosted Timber. It’s designed to enhance the efficiency of predictive fashions by exploiting the sample recognition capabilities embedded in deep studying networks. XGBoost is quick, environment friendly, and scalable, making it a preferred selection for individuals who want to coach giant fashions rapidly.
LSTM
In machine studying, a Lengthy Quick-Time period Reminiscence (LSTM) is a recurrent neural community that may be very helpful for duties reminiscent of machine translation and imaginative and prescient. LSTMs can bear in mind a number of forgetful episodes in order that they will generate the following sentence or picture given a earlier sentence or picture.
LSTMs are an incredible specialised algorithm for sure duties the place you must bear in mind one thing from one episode (context) and use it within the subsequent episode. For instance, you could wish to mannequin how somebody speaks by remembering previous sentences and utilizing that data to generate the following sentence.
LSTMs are particularly helpful when you could have so many comparable inputs (comparable pixels in a picture, comparable phrases in textual content). These conditions are referred to as streaming issues. With sufficient coaching knowledge, an LSTM can learn to generate completely different outputs with excessive accuracy given any subset of its inputs. For this reason they’re so fashionable for machine studying duties reminiscent of machine translation and recognition.
Methodology
The work pipeline concerned on this venture is as follows:
Import libraries.
Load the dataset.
Exploratory knowledge evaluation.
Encoding the labels.
Depend vectorization of predictor textual content values.
Splitting the dataset into coaching and testing units.
Constructing fashions utilizing logistic regression, naive Bayes, and XGBoost
Comparability of outcomes of the above fashions.
Constructing an LSTM mannequin.
Saving the mannequin for additional use.
Code Implementation
Step 1. Import Libraries
We first should import the mandatory libraries to work with any knowledge and construct an answer. Our venture will use Numpy, Pandas, Matplotlib, Seaborn, Scikit-Be taught, TensorFlow, and Keras.
import numpy as np
import pandas as pd
import matplotlib.pyplotas plt
import seaborn as sns
from wordcloud importWordCloud
Step 2. Load the Dataset
dataset = pd.read_csv(“C:CustomersadminDesktop
Python_anacondaTasksIdentify GenderGender_Data.csv”)
Step 3. Exploratory Knowledge Evaluation
Now that we’ve our knowledge prepared, allow us to look into it to grasp higher the info we might be working with.
Pattern of the Dataset
dataset.head()

Column Names and Knowledge Forms of the Attributes
Figuring out the info kinds of every attribute or column within the dataset helps resolve what sort of pre-processing ought to be achieved.
print(dataset.columns)
print(dataset.dtypes)

We see that there are two attributes within the dataset. The ‘Identify’ attribute corresponds to the title of the particular person, and the ‘Gender’ columns symbolize if they’re male or feminine.
Changing Column Values
Right here, 0 and 1 within the ‘Gender’ column consult with female and male, respectively. Nevertheless, for comfort, we will exchange them with ‘M’ and ‘F.’
dataset[‘Gender’] = dataset[‘Gender’].exchange({0:”M”,1:”F”})
The Form of the Knowledge
print(dataset.form)
Working the above code snippet reveals us that there are a complete of 53982 rows and a couple of columns. That’s, there are 53982 names.
No. of Distinctive Names and Searching for Class Imbalance
print(len(dataset[‘Name’].distinctive()))
Among the many 53982 Indian names, there are 53925 distinctive names, implying that there are 57 values which are repeated. These are the names which are used for each girls and boys and therefore have been labeled a number of occasions.
Allow us to create a plot to see what number of female and male names are current within the dataset.
sns.countplot(x=’Gender’,knowledge = dataset)
plt.title(‘No. of female and male names within the dataset’)
plt.xticks([0,1],(‘Feminine’,’Male’)

It’s evident from the above graph that there is no such thing as a main class imbalance.
Analyzing the Beginning Letter of Names
Usually, a couple of alphabets are mostly used as the primary alphabet in a reputation. Our dataset lets us see the distribution of English alphabets by beginning letters.
alphabets= [‘A’,’B’,’C’,’D’,’E’,’F’,’G’,’H’,’I’,’J’,’K’,’L’,’M’,’N’,’O’,’P’,
‘Q’,’R’,’S’,’T’,’U’,’V’,’W’,’X’,’Y’,’Z’]
startletter_count = {}
for i in alphabets:
startletter_count[i] = len(dataset[dataset[‘Name’].str.startswith(i)])
print(startletter_count)

Visualizing the above data utilizing a bar chart reveals that round 6,000 names begin with the letter “A”.
plt.determine(figsize = (16,8))
plt.bar(startletter_count.keys(),startletter_count.values())
plt.xlabel(‘Beginning alphabet’)
plt.ylabel(‘No. of names’)
plt.title(‘Variety of names beginning with every letter’)

Allow us to see what the most typical alphabets with which a lot of the names begin are.
print(‘The 5 most title beginning letters are : ‘,
*sorted(startletter_count.gadgets(), key=lambda merchandise: merchandise[1])[-5:][::-1])

Most Indian names begin with the alphabets A, S, Ok, V, and M.
Analyzing the Ending Letter of Names
Equally, now allow us to see what the widespread ending letters and their distribution throughout the names within the dataset are.
small_alphabets = [‘a’,’b’,’c’,’d’,’e’,’f’,’g’,’h’,
‘i’,’j’,’k’,’l’,’m’,’n’,’o’,’p’,’q’,’r’,’s’,’t’,’u’,’v’,’x’,’y’,’z’]
endletter_count ={}
for i in small_alphabets:
endletter_count[i]=len(dataset[dataset[‘Name’].str.endswith(i)])
print(endletter_count)

plt.determine(figsize = (16,8))
plt.bar(endletter_count.keys(),endletter_count.values())
plt.xlabel(‘Ending alphabet’)
plt.ylabel(‘No. of names’)
plt.title(‘Variety of names ending with every letter’)

The above bar graph depicts that roughly 16000 and 14000 names finish with the letters “a” and “n.”
print(‘The 5 most title endind letters are : ‘, *sorted(endletter_count.gadgets(),
key=lambda merchandise: merchandise[1])[-5:][::-1])
Executing the above-mentioned code provides us the next output:

Therefore, a lot of the names finish with the letters “a,” “n,” “i,” “h,” and “r.”
Phrase Cloud
Phrase clouds usually assist us visualize textual knowledge. We’re going to construct a phrase cloud representing the names within the dataset. The scale of every title shall rely upon its frequency within the dataset.
# constructing a phrase cloud
textual content = ” “.be a part of(i for i in dataset.Identify)
word_cloud = WordCloud(
width=3000,
top=2000,
random_state=1,
background_color=”white”,
colormap=”BuPu”,
collocations=False,
stopwords=STOPWORDS,
).generate(textual content)
plt.imshow(word_cloud)
plt.axis(“off”)
plt.present()

We are able to see that the names beginning with the letter ‘A’ are prominently seen within the phrase cloud. This helps our earlier evaluation that a lot of the names begin with the letter ‘A’ within the dataset.
Step 4. Constructing the Fashions
First, allow us to outline the predictor variable ‘X’ and the goal variable ‘Y.’ In our binary classification drawback, ‘Identify’ is the predictor, whereas ‘Gender’ is the goal attribute. We have to decide the gender based mostly on the title.
X =record( dataset[‘Name’])
Y = record(dataset[‘Gender’])
Encode the Labels
Now, we use the LabelEncoder function in Sklearn to transform the ‘F’ and ‘M’ labels right into a machine-readable format.
from sklearn.preprocessing importLabelEncoder
encoder= LabelEncoder()
Y = encoder.fit_transform(Y)
Depend Vectorization
We vectorize the names into vector-like knowledge to make the modeling course of simpler. The variable ‘X’ is remodeled into an array of vectors.
from sklearn.feature_extraction.textual content
import CountVectorizer
cv=CountVectorizer(analyzer=”char”)
X=cv.fit_transform(X).toarray()
Splitting the Dataset
Now that our goal and predictor variables are prepared for use for modeling, we break up the dataset into coaching and testing units. We will break up the info in order that 33% of it’s allotted for testing whereas the remainder is used for the preliminary coaching of the fashions.
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(X, Y, test_size=0.33, random_state=42)
Logistic Regression
Right here, we’re first going to construct and take a look at all of the fashions after which later consider their efficiency. The primary algorithm we are going to use is logistic regression. First, we are going to import the LogisticRegression perform from Scikit-Be taught after which create a mannequin utilizing it. Subsequent, we match the x_train and y_train into the mannequin for coaching functions. Lastly, we take a look at the mannequin on the take a look at dataset that we created earlier.
from sklearn.linear_model import LogisticRegression
LR_model= LogisticRegression()
LR_model.match(x_train,y_train)
LR_y_pred = LR_model.predict(x_test)
Naive Bayes
The pipeline for constructing the fashions shall stay the identical.
from sklearn.naive_bayes import MultinomialNB
NB_model= MultinomialNB()
NB_model.match(x_train,y_train)
NB_y_pred = NB_model.predict(x_test)
XGBoost
from xgboost import XGBClassifier
XGB_model = XGBClassifier(use_label_encoder= False)
XGB_model.match(x_train,y_train)
XGB_y_pred = XGB_model.predict(x_test)
Comparability of Efficiency
For evaluating the mannequin’s efficiency, we’re going to use accuracy as an analysis measure and in addition construct a confusion matrix to see what number of proper and mistaken predictions had been made by the respective mannequin.
# perform for confusion matrix
from sklearn.metrics import confusion_matrix
def cmatrix(mannequin):
y_pred = mannequin.predict(x_test)
cmatrix = confusion_matrix(y_test, y_pred)
print(cmatrix)
sns.heatmap(cmatrix,fmt=”d”,cmap=’BuPu’,annot=True)
plt.xlabel(‘Predicted Values’)
plt.ylabel(‘Precise Values’)
plt.title(‘Confusion Matrix’)
import sklearn.metrics as metrics
#for logistic regression
print(metrics.accuracy_score(LR_y_pred,y_test))
print(metrics.classification_report(y_test, LR_y_pred))
print(cmatrix(LR_model))


# for naive bayes
print(metrics.accuracy_score(NB_y_pred,y_test))
print(metrics.classification_report(y_test, NB_y_pred))
print(cmatrix(NB_model))


# for XGBoost
print(metrics.accuracy_score(XGB_y_pred,y_test))
print(metrics.classification_report(y_test, XGB_y_pred))
print(cmatrix(XGB_model))


Wanting on the above outputs, the accuracy of logistic regression is 71%. It labeled round 3000 girls’s names as males and 2300 males’s names as girls.
Out of all of the three talked about algorithms, XGBoost appears to have carried out higher. It had a reasonably good accuracy of 77%, with 4343 mistaken predictions made out of 17815 testing samples.
LSTM
Though we’ve obtained good accuracy utilizing XGBoost, we are able to additional enhance the classification utilizing deep studying fashions. LSTM is without doubt one of the most generally used neural networks for textual content classification. We’re going to construct an LSTM community for gender classification and take a look at its efficiency on our knowledge.
Import Crucial Libraries
Constructing an LSTM community requires extra superior libraries like Keras and TensorFlow.
Naive Bayes carried out means much less effectively than logistic regression, with solely 65% testing accuracy.
from tensorflow.keras import fashions
from tensorflow.keras.fashions import Mannequin
from tensorflow.keras.fashions import load_model
from keras.layers import Embedding
from tensorflow.keras.layers import Dense, Dropout, Flatten, Enter, LeakyReLU
from tensorflow.keras.layers import BatchNormalization, Activation, Conv2D
from tensorflow.keras.fashions import Sequential
from tensorflow.keras.layers import Dense, Flatten, MaxPooling2D, Dense, Dropout
from tensorflow.keras.layers import LSTM
Defining the LSTM Layers
max_words = 1000
max_len = 26
LSTM_model = Sequential()
LSTM_model.add(Embedding(voc_size,40,input_length=26))
LSTM_model.add(Dropout(0.3))
LSTM_model.add(LSTM(100))
LSTM_model.add(Dropout(0.3))
LSTM_model.add(Dense(64,activation=’relu’))
LSTM_model.add(Dropout(0.3))
LSTM_model.add(Dense(1,activation=’sigmoid’))
LSTM_model.compile(loss=”binary_crossentropy”,optimizer=”adam”,metrics=[‘accuracy’])
print(LSTM_model.abstract())

Coaching
Now that we’ve constructed the community, we’re going to practice it utilizing the x_train and y_train options. We will use 100 epochs to make sure that the mannequin can generalize precisely.
LSTM_model.match(x_train,y_train,epochs=100,batch_size=64)
This step will take a while to implement.

The above image reveals solely the final snippet of the output. We are able to see that LSTM has given an accuracy of 85%, which is 8% greater than XGBoost. Allow us to outline a perform that takes in any title as enter and classifies the title utilizing this LSTM mannequin.
def predict(title):
prediction = LSTM_model.predict([name_samplevector])
if prediction >=0.5:
out=”Male ♂”
else:
out=”Feminine ♀”print(title+’ is a ‘+ out)
Pattern Check
predict(‘Yamini Ane’)
name_samplevector = cv.rework([name]).toarray()

We are able to see that the mannequin has predicted the title ‘Yamini Ane’ as feminine. Nevertheless, there could possibly be some instances the place the mannequin makes mistaken predictions. This could possibly be as a result of solely Indian names had been used for coaching the mannequin.
Lastly, we’re going to save this LSTM for additional utilization.
import pickle
pickle.dump(LSTM_model, open(“LSTM_model.pkl”, ‘wb’))
Conclusion
This brings us to an finish to the title gender classification venture. Allow us to assessment our work. First, we began by defining our drawback assertion, trying into the algorithms we had been going to make use of and the NLP implementation pipeline. Then we moved on to virtually implementing the identification and classification of gender based mostly on names utilizing logistic regression, naïve Bayes, and XGBoost algorithms. Transferring ahead, we in contrast the performances of those fashions. Lastly, we constructed an LSTM community and proved that it really works finest for name-based gender identification NLP issues.
The important thing takeaways from this NLP venture are:
Identification of gender utilizing names is vital for a lot of companies.
XGBoost provides higher accuracy in comparison with logistic regression and naïve bayes when used for gender classification issues.
LSTM is a recurrent neural community that works finest for textual content classification.
LSTM gives an accuracy of 85%, giving out probably the most correct outcomes.
I hope you want my article on “Identify Gender Classification Utilizing NLP and Python.” All the code might be present in my GitHub repository. You possibly can join with me right here on LinkedIn.
The media proven on this article shouldn’t be owned by Analytics Vidhya and is used on the Writer’s discretion.