Tuesday , December 1 2020

How to upload and explore domestic electricity usage data

Given the increase in smart electricity meters and the widespread adoption of electrical generation technology such as solar panels, a large amount of data on the use of electricity is available.

These data represent a multivariate time series of variables related to nutrition, which in turn could be used to model and even predict future electricity consumption.

In this tutorial, you will discover a set of household energy consumption data for the provision of multi-step time series and how to better understand raw data using exploratory analysis.

After completing this tutorial, you will know:

  • The household energy consumption data set that describes the use of electricity for a single home for four years.
  • How to explore and understand the data set using a series of line charts for series data and the histogram for data distributions.
  • How to use the new understanding of the problem to consider different framing of the prediction problem, ways in which data can be prepared and modeling methods that can be used.

Let's begin.

How to upload and explore domestic electricity usage data

How to upload and explore domestic electricity usage data
Photo of Sheila Sund, some rights reserved.

Overview of the exercise

This tutorial is divided into five parts; they are:

  1. Dataset Energy consumption
  2. Upload data set
  3. Patterns in Observations Over Time
  4. Time series data distributions
  5. Modeling ideas

Dataset Domestic energy consumption

The Home Energy Consumption Data Set is a multivariate time data set that describes the electricity consumption for a single family over four years.

Data were collected between December 2006 and November 2010 and observations on energy consumption within the family were collected every minute.

It is a multivariate series consisting of seven variables (in addition to the date and time); they are:

  • global_active_power: The total active power consumed by the family (kilowatt).
  • global_reactive_power: The total reactive power consumed by the family (kilowatt).
  • voltage: Average voltage (volts).
  • global_intensity: Average current intensity (ampere).
  • sub_metering_1: Active energy for kitchens (watt hours of active energy).
  • sub_metering_2: Active energy for laundry (watt hour of active energy).
  • sub_metering_3: Active energy for climate control systems (watt hour of active energy).

Active and reactive energy refers to the technical details of alternative current.

In general terms, active energy is the true power consumed by the family, while reactive energy is the unused power in the lines.

We can see that the data set provides the active power and a part of the active power from the main circuit of the house, especially kitchen, laundry and air conditioning. These are not all family circuits.

The remaining watt-hours can be calculated from the active energy by first converting the active energy into watt hours, then subtracting the other active energy sub-measured in watt-hour, as follows:

The data set appears to have been provided without a seminal reference document.

However, this data set has become a standard for the evaluation of time series predictions and machine learning methods for multi-step prediction, in particular for active power prediction. Furthermore, it is not clear whether the other features in the data set can benefit from a model in the prediction of active power.

Need help with Deep Learning for Time Series?

Get my free 7-day email course now (with sample code).

Click to register and get a free PDF Ebook version of the course.

Download your FREE mini-course

Upload data set

The data set can be downloaded from the UCI Machine Learning repository as a single 20-megabyte .zip file:

Download the data set and unzip it into your current working directory. Now you will have the file "household_power_consumption.txt"It's about 127 megabytes and contains all the observations

Inspect the data file.

The following are the first five lines of data (and the header) from the raw data file.

We can see that the data columns are separated by semicolons (& # 39;;').

The data is said to have a row for each day over the period of time.

The data have missing values; for example, we can see 2-3 days of missing data around 28/4/2007.

We can start by loading the data file as Pandas DataFrame and summarizing the uploaded data.

We can use the read_csv () function to load the data.

It is easy to load data with this function, but it is a bit difficult to load it correctly.

Specifically, we must do some personalized things:

  • Specify the separator between columns as a semicolon (sep = & # 39 ;; & # 39;)
  • Specify that row 0 has the names for the columns (header = 0)
  • Specify that we have a lot of RAM to avoid warning that we are loading data as an array of objects instead of an array of numbers, because of the "?" For missing data (low_memory = False).
  • Specify that it is good for Pandas to try to infer the date-time format during the date analysis, which is much faster (infer_datetime_format = True)
  • Specify that we would like to parse date and time columns together as a new column called & # 39; datetime & # 39; (parse_dates = {& # 39; datetime & # 39 ;:[0,1]})
  • Specifies that we would like our new column & # 39; datetime & # 39; was the index for the DataFrame (index_col =[‘datetime’]).

By putting all this together, we can now load the data and summarize the loaded form and the first few lines.

Next, we can mark all missing values ​​indicated with a & # 39;? & # 39; With a NaN value, which is a float.

This will allow us to work with data as an array of floating point values ​​rather than mixed types, which is less efficient.

Now we can create a new column that contains the rest of the sub-metering, using the calculation of the previous section.

Now we can save the cleaned version of the data set in a new file; in this case we will simply change the file extension in .csv and save the data set as & # 39;household_power_consumption.csv'.

To confirm that we have not messed up, we can reload the data set and summarize the first five lines.

By linking all this together, the complete example of loading, cleaning and saving the data set is listed below.

Example execution loads raw data first and summarizes the form and the first five lines of the uploaded data.

The data set is then cleaned up and saved to a new file.

We load this new file and print the first five lines again, showing the removal of the date and time columns and adding the new sub-metered column.

We can peek inside the new & # 39;household_power_consumption.csv& # 39; File and check that the missing observations are marked with an empty column, that the pandas will read correctly as NaN, for example around line 190.499:

Now that we have a clean version of the data set, we can investigate further using the views.

Patterns in Observations Over Time

The data is multivariate time series and the best way to understand a time series is to create line charts.

We can start by creating a separate line chart for each of the eight variables.

The complete example is listed below.

Execution of the example creates a single image with eight sub-frames, one for each variable.

This gives us a really high level of the four years of one minute observations. We can see that something interesting was happeningSub_metering_3& # 39; (Environmental Control) which may not be mapped directly into hot or cold years. Perhaps new systems have been installed.

Interestingly, the contribution of the & # 39;sub_metering_4"It seems to diminish with time, or show a downward trend, perhaps by matching the sharp increase in view towards the end of the series for & # 39;Sub_metering_3'.

These observations reinforce the need to respect the temporal order of the sub-sequences of these data when adapting and evaluating any model.

We may be able to see the wave of a seasonal effect in the & # 39;Global_active_power& # 39; And some other variables.

There is a spiky use that could coincide with a specific period, such as weekends.

Line plot of each variable in the Energy consumption data set

Line plot of each variable in the Energy consumption data set

We zoom in and focus on the & # 39;Global_active_power& # 39 ;, o & # 39;Active power& # 39; In short.

We can create a new plot of active power for each year to see if there are any common patterns over the years. The first year, 2006, has less than a month of data, so I will remove it from the plot.

The complete example is listed below.

Execution of the example creates a single image with four line charts, one for each full year (or mostly for whole years) of data in the data set.

We can observe some common gross models over the years, for example around February-March and around August-September, where we see a marked decrease in consumption.

We also seem to see a downward trend in the summer months (half of the year in the northern hemisphere) and perhaps more consumption in the winter months towards the edges of the plots. These can show an annual seasonal trend in consumption.

We can also see some patches of missing data in at least the first, third and fourth graphs.

Line Plots of Active Power for most of the years

Line Plots of Active Power for most of the years

We can continue to increase consumption and watch active power for each of the 12 months of 2007.

This could help to stimulate gross structures over the months, such as daily and weekly models.

The complete example is listed below.

Execution of the example creates a single image with 12 line charts, one for each month in 2007.

We can see the wave of energy consumption of the days within each month. This is good as we would expect some sort of daily pattern in energy consumption.

We can see that there are stretches of days with minimum consumption, as in August and April. These may represent vacation periods when the house was not occupied and where energy consumption was minimal.

Line Plots for active power for all months in a year

Line Plots for active power for all months in a year

Finally, we can enlarge another level and take a closer look at daily energy consumption.

We would expect there to be some consumption patterns every day and maybe differences in days in the span of a week.

The complete example is listed below.

The execution of the example creates a single image with 20 dashes, one for the first 20 days in January 2007.

There is commonality in every day; for example, many days the consumption starts early in the morning, around 6-7am.

Some days show a drop in consumption in the middle of the day, which might make sense if most of the occupants were away from home.

We see some strong night-time consumption on some days, which in a northern hemisphere could coincide with a heating system used in January.

The period of the year, particularly the season and the time it entails, will be an important factor in the modeling of this data, as one would expect.

Line plot for active power for 20 days in a month

Line plot for active power for 20 days in a month

Time series data distributions

Another important area to consider is the distribution of variables.

For example, it might be interesting to know if the distributions of observations are Gaussian or some other distribution.

We can examine data distributions by examining the histograms.

We can start by creating a histogram for each time series variable.

The complete example is listed below.

Execution of the example creates a single figure with a separate histogram for each of the 8 variables.

We can see that active and reactive power, intensity, as well as sub-measured power are all distorted distributions towards watt-hour or kilowatt values.

We can also see that the distribution of voltage data is strongly Gaussian.

Histogram charts for each variable in the Energy consumption data set

Histogram charts for each variable in the Energy consumption data set

The distribution of active power appears to be bi-modal, which means that it has two groups of average observations.

We can further investigate the distribution of active energy consumption for the full four years of data.

The complete example is listed below.

The execution of the example creates a unique plot with four digits, one for each of the years between 2007 and 2010.

We can see that the distribution of active energy consumption in all these years seems very similar. The distribution is in fact bimodal with a peak around 0.3 KW and perhaps another around 1.3 KW.

There is a long queue on the distribution at higher kilowatt values. He could open the door to the notions of discretizing the data and separating it in either peak 1, peak 2, or long tail. These groups or clusters to be used in a day or an hour could be useful for the development of a predictive model.

Plots of Active Power histograms for most of the years

Plots of Active Power histograms for most of the years

It is possible that the identified groups may vary during the seasons of the year.

We can investigate this by looking at the active power distribution for each month in a year.

The complete example is listed below.

L'esecuzione dell'esempio crea un'immagine con 12 grafici, uno per ogni mese nel 2007.

Possiamo vedere generalmente la stessa distribuzione di dati ogni mese. The axes for the plots appear to align (given the similar scales), and we can see that the peaks are shifted down in the warmer northern hemisphere months and shifted up for the colder months.

We can also see a thicker or more prominent tail toward larger kilowatt values for the cooler months of December through to March.

Histogram Plots for Active Power for All Months in One Year

Histogram Plots for Active Power for All Months in One Year

Ideas on Modeling

Now that we know how to load and explore the dataset, we can pose some ideas on how to model the dataset.

In this section, we will take a closer look at three main areas when working with the data; they are:

  • Problem Framing
  • Data Preparation
  • Modeling Methods

Problem Framing

There does not appear to be a seminal publication for the dataset to demonstrate the intended way to frame the data in a predictive modeling problem.

We are therefore left to guess at possibly useful ways that this data may be used.

The data is only for a single household, but perhaps effective modeling approaches could be generalized across to similar households.

Perhaps the most useful framing of the dataset is to forecast an interval of future active power consumption.

Four examples include:

  • Forecast hourly consumption for the next day.
  • Forecast daily consumption for the next week.
  • Forecast daily consumption for the next month.
  • Forecast monthly consumption for the next year.

Generally, these types of forecasting problems are referred to as multi-step forecasting. Models that make use of all of the variables might be referred to as a multivariate multi-step forecasting models.

Each of these models is not limited to forecasting the minutely data, but instead could model the problem at or below the chosen forecast resolution.

Forecasting consumption in turn, at scale, could aid in a utility company forecasting demand, which is a widely studied and important problem.

Data Preparation

There is a lot of flexibility in preparing this data for modeling.

The specific data preparation methods and their benefit really depend on the chosen framing of the problem and the modeling methods. Nevertheless, below is a list of general data preparation methods that may be useful:

  • Daily differencing may be useful to adjust for the daily cycle in the data.
  • Annual differencing may be useful to adjust for any yearly cycle in the data.
  • Normalization may aid in reducing the variables with differing units to the same scale.

There are many simple human factors that may be helpful in engineering features from the data, that in turn may make specific days easier to forecast.

Some examples include:

  • Indicating the time of day, to account for the likelihood of people being home or not.
  • Indicating whether a day is a weekday or weekend.
  • Indicating whether a day is a North American public holiday or not.

These factors may be significantly less important for forecasting monthly data, and perhaps to a degree for weekly data.

More general features may include:

  • Indicating the season, which may lead to the type or amount environmental control systems being used.

Modeling Methods

There are perhaps four classes of methods that might be interesting to explore on this problem; they are:

  • Naive Methods.
  • Classical Linear Methods.
  • Machine Learning Methods.
  • Deep Learning Methods.

Naive Methods

Naive methods would include methods that make very simple, but often very effective assumptions.

Some examples include:

  • Tomorrow will be the same as today.
  • Tomorrow will be the same as this day last year.
  • Tomorrow will be an average of the last few days.

Classical Linear Methods

Classical linear methods include techniques are very effective for univariate time series forecasting.

Two important examples include:

  • ETS (triple exponential smoothing)

They would require that the additional variables be discarded and the parameters of the model be configured or tuned to the specific framing of the dataset. Concerns related to adjusting the data for daily and seasonal structures can also be supported directly.

Machine Learning Methods

Machine learning methods require that the problem be framed as a supervised learning problem.

This would require that lag observations for a series be framed as input features, discarding the temporal relationship in the data.

A suite of nonlinear and ensemble methods could be explored, including:

  • k-nearest neighbors.
  • Support vector machines
  • Decision trees
  • Casual forest
  • Gradient boosting machines

Careful attention is required to ensure that the fitting and evaluation of these models preserved the temporal structure in the data. This is important so that the method is not able to ‘cheat’ by harnessing observations from the future.

These methods are often agnostic to large numbers of variables and may aid in teasing out whether the additional variables can be harnessed and add value to predictive models.

Deep Learning Methods

Generally, neural networks have not proven very effective at autoregression type problems.

Nevertheless, techniques such as convolutional neural networks are able to automatically learn complex features from raw data, including one-dimensional signal data. And recurrent neural networks, such as the long short-term memory network, are capable of directly learning across multiple parallel sequences of input data.

Further, combinations of these methods, such as CNN LSTM and ConvLSTM, have proven effective on time series classification tasks.

It is possible that these methods may be able to harness the large volume of minute-based data and multiple input variables.

Further Reading

This section provides more resources on the topic if you are looking to go deeper.


In this tutorial, you discovered a household power consumption dataset for multi-step time series forecasting and how to better understand the raw data using exploratory analysis.

Specifically, you learned:

  • The household power consumption dataset that describes electricity usage for a single house over four years.
  • How to explore and understand the dataset using a suite of line plots for the series data and histogram for the data distributions.
  • How to use the new understanding of the problem to consider different framings of the prediction problem, ways the data may be prepared, and modeling methods that may be used.

Do you have any questions?
Ask your questions in the comments below and I will do my best to answer.

Develop Deep Learning models for Time Series Today!

Deep Learning for Time Series Forecasting

Develop Your Own Forecasting models in Minutes

…with just a few lines of python code

Discover how in my new Ebook:
Deep Learning for Time Series Forecasting

It provides self-study tutorials on topics like: CNNs, LSTMs,
Multivariate Forecasting, Multi-Step Forecasting and much more…

Finally Bring Deep Learning to your Time Series Forecasting Projects

Skip the Academics. Just Results.

Click to learn more.

Source link

Leave a Reply

Your email address will not be published.