Temperature Data Visualization in Python
Daily high/low temperatures for 2020 as compared to the preceding 20 years for Central Connecticut. The solid red and blue circles on the chart indicate days in 2020 that set new record highs or lows as compared to the preceding 20 years.
The chart’s layout is from a University of Michigan course assignment on data visualization using Python which is part of the Applied Data Science with Python Specialization offered on Coursera. The data are from NOAA’s Climate Data website (https://www.ncdC.noaA.gov/cdo-web/) and represent measurements from 5 separate weather stations in CT. The data include station location, date, and Min/Max daily temperatures. The values are provided in chronological order, one station at a time, from January 1, 2000 to December 25, 2020.
With interest in Data Science, Machine Learning, and Artificial Intelligence, the chart’s creation was an exercise. Although I haven’t tried to re-create it in Excel, it would be significantly more cumbersome compared to python. The amount of data and the manipulation required to identify and plot selected points would be burdensome. Although Python has a steeper learning curve compared to Excel, once learned, python simplifies data manipulation tasks and opens up possibilities for cleaning, organizing, and analyzing large sets including building machine learning models.
When moving from Excel to Python, it takes time to get used to not seeing the rows of numbers and aimlessly scrolling with a mouse. During to my many years of habitual Excel use, it must have become an unconscious habit to scroll through the numbers. In python, I struggle with the urge to scroll in search of specific data points. The temperature file plotted above has over 31,000 rows. It’s useless scrolling manually in search of any point. Rather than scrolling, a specific point is located through inquiry. Similar to a sci-fi movie where the protagonist engages a computer with dialogue, asking a series of questions that lead to the solution, in python, you code questions into the computer.
To create the chart above, data is grouped by day for every day of the year which includes 5 stations, 20 years each. That’s 365 groups of 20 high and 20 low temperature data points from which the highest high and lowest low is identified for each day. These are the upper and lower solid black lines in the chart. For 2020, the current year, daily temperatures are compared to the upper and lower curves to find new temperatures that exceed the previous records.
The data from NOAA is structured. Based on location, within any column, each data point is labeled by the columns header. Despite being structured, it still requires cleaning. Over a 20 year period, stations will have multiple days with missing data or erroneous readings. The missing data is harmless, but erroneous readings result in unrealistic spikes, drops, or days where the minimum temperature exceeds the maximum. Calculating and querying the daily temperature range helps identify issues. The outlier in question can be compared with different weather station reading from the same day. Cleaning, also known as data wrangling or munging, is an unavoidable task and one that python simplifies.
Below: Chart of California high/low temperatures from Jan 1, 2000 through Christmas 2020. Data from 4 weather stations from the LA area as shown on map.
Below: Michigan high/low temperatures from Jan 1, 2000 through Christmas 2020. Data from 4 weather stations in the Ann Arbor area as shown on map.