Predicting an adult’s gender based on arm circumference or upper leg length won’t address any modern day problem. Should the need arise, it is more accurate to simply inquire if the person is male or female. However, using machine learning to predict gender presents an opportunity to explore binary classification, and binary classification is a suitable solution for addressing countless yes/no, heads/tails, pass/fail problems. Since these techniques can be extrapolated to other issues and the measurement data is publicly available, not to mention that python has machine learning capabilities, constructing this model to predict gender has indirect benefits.
The data come from the 2017-2020 pre-pandemic National Health and Nutrition Examination Survey. I selected 6 measurements: height, weight, hip circumference, waist circumference, upper leg length, and arm circumference. Although the model performed well using only weight and hip, the score improved after adding each of the other measurements. If the model was based on only 2 variables, it would be a simple line dividing male and female. When 3 variables are used, the boundary becomes a plane, and when more than 3 variables are incorporated, the boundary is a hyperplane. It is my opinion that the opportunity to include a large number of variables is one of the most appealing aspects of machine learning.
The scatter matrix above contains measurement data from approximately 9,700 respondents 12 years old and older. Along the diagonal of the scatter matrix are histograms. Each measurement follows a normal distribution where most persons, male and female, are average with fewer persons symmetrically distributed above and below the average in a bell shaped curve. The pairwise scatter plots demonstrate some clustering between male and female data which is most pronounced in the Hip-Weight scatter plot and least pronounced, perhaps not present, in the Waist-Arm plot.
To build the machine learning model, the 9,700 respondents are randomly divided into two groups, using a 75:25 split, called the training and test sets. The training set includes gender labels along with the measurements, while the test set contains only the measurement data. A model is fit to the labeled data in the training set which is referred to as supervised learning because the model learns from the properly labeled data. After fitting the training data, the fit is applied to the test data, and the model predicts gender. The predictions are scored by comparing them to the actual gender labels that were initially removed from the test set. The score is tuned by tweaking the weighting parameters applied to the initial fit against the training data. These iterations are made to balance what is referred to as over and underfitting the training data. The K-nearest neighbors classifier had an 82% accuracy on the test set. As the name implies, this algorithm looks at each point’s nearest, labeled, neighbors to decide its label. The highest score achieved was 92% using a Support Vector Machine classifier which works by creating a decision boundary between the male and female data.
Python makes it easy to explore, query, analyze, and visualize data sets. Experimenting with machine learning facilitates deeper observation and insight into the data. The tools applied in this example directly apply to other binary classification problems. A similar model could be used in product development or manufacturing. In manufacturing, the input would include different part numbers. Input variables could include part complexity based on the number of sub components, specific part dimensions and tolerances, manufacturing processes involved, material types, vendors, sub-contracted processes, order quantities, order date, ect… The binary target variable could look at any number of indicators indicative of healthy operations; such as, on time delivery, profit margin, scrap, customer feedback, defects, ect…
Python libraries used in this exercise: pandas, numpy, matplotlib, seaborn, scikit learn.
Glenn DiCostanzo
October 10, 2021