Academia.eduAcademia.edu
Citation: Hien, NLH and Kor, A-L (2022) Analysis and Prediction Model of Fuel Consumption and Carbon Dioxide Emissions of Light-Duty Vehicles. Applied Sciences, 12 (2). p. 803. ISSN 2076-3417 DOI: https://doi.org/10.3390/app12020803 Link to Leeds Beckett Repository record: https://eprints.leedsbeckett.ac.uk/id/eprint/8347/ Document Version: Article (Published Version) Creative Commons: Attribution 4.0 The aim of the Leeds Beckett Repository is to provide open access to our research, as required by funder policies and permitted by publishers and copyright law. The Leeds Beckett repository holds a wide range of publications, each of which has been checked for copyright and the relevant embargo period has been applied by the Research Services team. We operate on a standard take-down policy. If you are the author or publisher of an output and you would like it removed from the repository, please contact us and we will investigate on a case-by-case basis. Each thesis in the repository has been cleared where necessary by the author for third party copyright. If you would like a thesis to be removed from the repository or believe there is an issue with copyright, please contact us on openaccess@leedsbeckett.ac.uk and we will investigate on a case-by-case basis. applied sciences Article Analysis and Prediction Model of Fuel Consumption and Carbon Dioxide Emissions of Light-Duty Vehicles Ngo Le Huy Hien and Ah-Lian Kor * School of Built Environment, Engineering and Computing, Leeds Beckett University, Leeds LS6 3HF, UK; n.hien2994@student.leedsbeckett.ac.uk * Correspondence: a.kor@leedsbeckett.ac.uk; Tel.: +44-113-812-3243   Citation: Hien, N.L.H.; Kor, A.-L. Analysis and Prediction Model of Fuel Consumption and Carbon Abstract: Due to the alarming rate of climate change, fuel consumption and emission estimates are critical in determining the effects of materials and stringent emission control strategies. In this research, an analytical and predictive study has been conducted using the Government of Canada dataset, containing 4973 light-duty vehicles observed from 2017 to 2021, delivering a comparative view of different brands and vehicle models by their fuel consumption and carbon dioxide emissions. Based on the findings of the statistical data analysis, this study makes evidence-based recommendations to both vehicle users and producers to reduce their environmental impacts. Additionally, Convolutional Neural Networks (CNN) and various regression models have been built to estimate fuel consumption and carbon dioxide emissions for future vehicle designs. This study reveals that the Univariate Polynomial Regression model is the best model for predictions from one vehicle feature input, with up to 98.6% accuracy. Multiple Linear Regression and Multivariate Polynomial Regression are good models for predictions from multiple vehicle feature inputs, with approximately 75% accuracy. Convolutional Neural Network is also a promising method for prediction because of its stable and high accuracy of around 70%. The results contribute to the quantifying process of energy cost and air pollution caused by transportation, followed by proposing relevant recommendations for both vehicle users and producers. Future research should aim towards developing higher performance models and larger datasets for building APIs and applications. Dioxide Emissions of Light-Duty Vehicles. Appl. Sci. 2022, 12, 803. https://doi.org/10.3390/ Keywords: carbon dioxide emissions; light-duty vehicles; fuel consumption; regression models; machine learning; convolutional neural network; prediction model; estimation model; climate change app12020803 Academic Editor: Juan Francisco De Paz Santana Received: 1 December 2021 Accepted: 8 January 2022 Published: 13 January 2022 Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. Copyright: © 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/). 1. Introduction With the accelerated growth of urbanization, environmental issues caused by transportation have been challenging due to the significant negative impact on climate change [1]. Although the COVID-19 pandemic (commencing in 2020) has temporarily lessened the amount of greenhouse gas emitted into the atmosphere, the temperature of the planet is increasing due to ever-increasing air pollutants [2]. Moreover, 20 to 30% of global greenhouse gases (GHG) are emitted from passenger and freight transportation [3], and 75% of total carbon dioxide emissions originate from passenger cars [4]. Despite stringent fuel and greenhouse gas emission standards regulations, the number of used vehicles has significantly increased, corresponding with the rise in vehicle miles traveled (VMT), leading to their large percentage in air pollutant emissions and natural resource consumption [5]. Estimating and visualizing fuel consumption and exhaust emissions are critical for quantifying the energy cost and air pollution caused by transportation [6], as well as detailing emission control strategies [7]. As, in the past decade, there has been a pressing concern about climate change, estimation models of CO2 emissions and fuel consumption from vehicles are of increasing significance. Therefore, this has invoked a global interest in applied research (in the areas of data analytics and machine learning) for sustainability among global researchers and engineers [8,9]. Appl. Sci. 2022, 12, 803. https://doi.org/10.3390/app12020803 https://www.mdpi.com/journal/applsci Appl. Sci. 2022, 12, 803 2 of 29 Although many studies have introduced various machine learning models and techniques for the estimation of carbon dioxide emissions and fuel consumption, the trend focuses more on optimizing models rather than using vehicle metrics to analyze different vehicle types and brands [8,10,11]. Therefore, a comparative study of different types of vehicles and their effect on the environment has a significance for the vehicle market. Such research provides deep insights into understanding its environmental impacts. This identified gap is addressed by this research, that is, to provide an insight into vehicle fuel consumption and carbon dioxide emission through a series of rigorous data analytics and machine learning. It is worthwhile to note that the data analysis and machine techniques applied in this research are transferable to similar datasets. The following research objectives (RO) support the aim of this research. • • • • RO1: To carry out a thorough systematic literature review of fuel consumption and carbon dioxide emissions for new light-duty vehicles for retail sale (use case: in Canada); RO2: To identify suitable datasets for analysis and implement the data preparation process; RO3: To utilize appropriate indicators to measure and analyze the sustainable impact of vehicles; RO4: To implement the following data analytics methodologies on the final dataset by addressing corresponding research questions (RQ). 1. Level 1: Descriptive Statistical Analysis – – 2. Level 2: Inferential Statistical Analysis – – – – 3. – RQ3.1 Can fuel consumption and carbon dioxide emission data, and other input metrics be utilized to predict outputs in upcoming years in Canada? RQ3.2 Is it possible to build Machine Learning models that use vehicle specifications data to predict their fuel consumption and carbon dioxide emission? Level 4: Deep Learning – • RQ2.1 Is there any particular distribution for fuel consumption in the city and the highway of vehicles in Canada? RQ2.2 Is there a notable difference in the performance of one specific vehicle (or fuel) type in comparison to the rest of the vehicle types in Canada? RQ2.3 How does the brand, model, vehicle class, engine size, cylinder, transmission type, and fuel type correlate with consumption and emissions of various vehicles? RQ2.4 What are the relationships between all features to each other of the entire dataset? Level 3: Machine Learning – 4. RQ1.1 How do light-duty vehicles compare in terms of fuel consumption and CO2 emission? RQ1.2 How have patterns of fuel consumption and emission of each vehicle type changed throughout the selected period? RQ4.1 Is it possible to construct Deep Learning models that use vehicle specifications data to predict their fuel consumption and carbon dioxide emission? RO5 To make recommendations and possible regulations and define areas of future research. To implement and address the listed research objectives, an analytical and predictive study has been conducted on the Government of Canada dataset, containing 4973 lightduty vehicles observed from 2017 to 2021. Using the above-mentioned four levels of data analytics methodology (i.e., Descriptive Statistical Analysis, Inferential Statistical Analysis, Machine Learning, and Deep Learning), the study unravels current trend and comparative Appl. Sci. 2022, 12, 803 3 of 29 analysis of fuel consumption and carbon dioxide emissions from different brands, vehicle models, vehicle class, cylinders, engine size, transmission, fuel type, smog rating, and fuel consumption within a city and on a highway. The research also predicts these features in the upcoming year and builds up a predictive model for fuel consumption and carbon dioxide emission based on relevant car specifications. The results contribute to the quantifying process of energy cost and air pollution caused by transportation, followed by proposing relevant recommendations for both vehicle users and producers. The prediction results from this study discard abrupt factors, such as legislative requirements, unpredictable economic crises, or similar unforeseen interruptions. 2. Literature Review With the current alarming rate of climate change, due attention ought to be given to the environmental impact of fuel consumption and emissions from light-duty vehicles, particularly passenger cars. Vehicle emissions can be classified into two principal categories: dangerous exhaust emissions for air quality and human health; and emissions that contribute towards climate change. The emission that has the most significant effect on climate change is carbon dioxide (CO2 ), which represents the largest proportion of the Green House Gas (GHG) emissions. Notably, road transportation emits about one-fifth of the total emissions of carbon dioxide in the European Union, 75% of which arises from passenger cars [4]. Moreover, the relation between fuel consumption and CO2 is direct and strong [12]. In the European Union (EU), average fleet emission limits are stated in terms of CO2 emissions, in grams per kilometer unit. In North America (i.e., the United States (US), and Canada), similar measures have been used, but with limits imposed in terms of fuel economy. Electric vehicles are a critical step in the transportation sector’s decarbonization. However, the International Energy Agency estimates that, by 2030, it is needed to have at least 20% of all road transport vehicles to be powered by electricity in order to keep global warming below 2 °C (approximately 300 million vehicles) [13]. Consequently, light-duty vehicles with low carbon intensity will continue to play a significant role during the transition. Moreover, legislative requirements have been discussed globally; for example, the European Union (EU) has adopted a climate change agenda to reduce GHG emissions by over 55% by 2030 compared to 1990 [14] and become a net-zero GHG emission economy by 2050 [15]. In addition, the Government of Canada has also set the target of reducing its emissions by 40–45% by 2030 and committed to achieving net-zero emissions by 2050 to avert the worst effects of climate change [16]. Therefore, to satisfy those limits in CO2 and achieve such high targets from legislative requirements, many worldwide researchers have proposed different vehicle emissions and consumption models. The systematic process for this literature review is to specify current approaches that have been used by various researchers, identify which models and methodologies have been used in each approach, before identifying the research gap. 2.1. Vehicle Emissions Estimation Models A number of vehicle emissions estimation models have been introduced by different researchers in the last decades. Using look-up tables, a micro-scale model called CORSIM is built to estimate emissions based on dynamometer data. To ascertain the total emissions of each link, the CORSIM model applies default emission rates per second to each vehicle that travels on the given link, based on acceleration and speed [17]. EMIT is a model for estimating HC, CO2 , CO, and NOx, which is built from dynamometer data of 344 light-duty vehicles and employs a regression equation with acceleration and speed [18]. At the project or regional level, a United States agency has proposed a model called MOVES in 2010 for the estimations of greenhouse gas emissions: CO, VOCs, PM, and NOx generated from lightduty vehicles [19]. Features such as vehicle mass, total resistance force, velocity, acceleration, and driveline performance have been employed by Rakha and colleagues to build a model for estimating CO2 emissions using instantaneous vehicle power [20]. A function of acceleration and velocity observed from a dynamometer experiment has been applied to the Appl. Sci. 2022, 12, 803 4 of 29 INTEGRATION model for the estimation of emissions from measured fuel consumption. Additionally, it is further developed for the simulation and optimization of trip-based microscopic traffic [21]. Using more parameters, including 55 parameters, a model named CMEM is proposed by a group of researchers to estimate parameters for a wide range of light-duty vehicles. For dynamometer testing, this model uses emissions per second data of CO, CO2 , NO, and HC, along with physical vehicle features (engine size, vehicle mass, and aerodynamic drag coefficient) and operating features (acceleration and speed) [22]. Another example of using data-intensive parameters is MEASURE, which was invented by the Georgia Institute of Technology. It calculates the emissions of NOx, CO, and VOCs from vehicle operating modes, including acceleration, deceleration, cruise, and idling. However, CO2 estimation is not included in this model, while it has over 30 features as its inputs [23]. Another well-known framework has been developed by the European Environment Agency (EEA) called COPERT, which became one of the standard methodologies for road transport emission inventories in EEA member countries [24]. It estimates primary air pollutants (CO, NOx, PM, VOC, SO2 , NH3 , heavy metals) and greenhouse gas emissions (CO2 , N2 O, CH4 ) using functions of the mean traveling speed throughout a complete driving cycle [25]. However, the framework neglected other characteristics while estimating the emissions of a specific vehicle, such as engine size, cylinders, and engine model. Furthermore, some recent research authors have applied Machine Learning and Deep Learning methodologies for vehicle emission models. Toth-Nagy and colleagues, for instance, have proposed a model using the Artificial Neural Network to predict emissions of NOx and CO from heavy-duty vehicles. Though the outcome is positive, CO2 has also not been included, and the model is appropriate for gasoline vehicles [26]. When testing on the real-world driving conditions of 70 diesel vehicles, a group of researchers implemented a machine learning model to make projections of emissions alongside the performance of vehicles. A look-up table, non-linear regression, and Neural Network Multilayer Perceptron models are consequently applied for instantaneous NOx predictions. Despite the model taking inputs of vehicle acceleration and speed, its outputs focus only on NOx estimation, and CO2 remains excluded [27]. Qing et al. have built a model for estimating vehicle emission rates, including CO, CO2 , HC, and NOx from vehicle idling by using Portable Emission Measurement System. The dataset is collected from actual driving tests; Boosted and Bagged Decision Trees are introduced as a reliable prediction model for vehicle emissions estimation [28]. It can be seen that applying Machine Learning and Deep Learning techniques for predicting carbon dioxide emissions remains limited and needs further development, which is thereby, the principal goal for this study. 2.2. Vehicle Consumption Estimation Models On the other hand, some researchers have focused on the fuel consumption of vehicles rather than CO2 emissions, as fuel consumption (and economic costs) seem to be more relevant to consumers in general. The vehicle fuel consumption models are classified into 2 categories: theoretical fuel consumption models and statistical fuel consumption models [29]. The theoretical fuel consumption model concentrates on the operation features of the vehicle, such as output power and engine parameters, while the statistical fuel consumption model converges the statistical attributes from vehicle activity and fuel consumption data, including acceleration and speed [30]. One of the fuel consumption models is based on a novel macroscopic model that considers trip time and intersection distance for prediction [31]. Using the distribution of Vehicle Specific Power, a fuel consumption prediction model is proposed by Qi et al., which comprises a fuel consumption model and traffic condition predictor to provide a real-time prediction. From this, an API is developed for fuel consumption estimation, using on-board diagnostic (OBD) data for verification, with a 20% forecasting error. By collecting driving behavior data from consumers’ smartphones, a prediction model of fuel consumption is developed based on a backpropagation (BP) neural network, random forests, and support vector regression with a relative error of less than 10%. It is also found that the average acceleration and deceleration, acceleration Appl. Sci. 2022, 12, 803 5 of 29 time percentage, deceleration time percentage, and cruising time percentage are major indicators for fuel consumption estimation [10]. Furthermore, Tamer et al. has proposed an approach to estimate fuel consumption by onboard vehicle information system Onboard Diagnoses-II (OBD-II) using Support Vector Machine and Lagrange interpolation. The model successfully provided precise fuel consumption with a square root mean difference of 2.43 [32]. Applying a Machine Learning model, a neural-network-based fuel prediction model is presented by utilizing seven predictors obtained from road grade and vehicle speed. It could optimize fuel usage over the entire fleet, with a peak-to-peak error rate of less than 4% in both city and highway [11]. Furthermore, vehicle emission and consumption can be predicted based on one single model. For example, by using GPS Big Data, an N-Dimensional framework is proposed by a group of researchers for estimating and visualizing fuel consumption and emissions. They stated that analyzing GPS big data generated from vehicles can deliver practical insight on the quantity and distribution of energy use and emissions in real-world driving conditions (acceleration, idle, cruise, and deceleration). This model has claimed effectiveness by a prediction accuracy of 88.6% [8]. Additionally, several statistical models of vehicle emissions and fuel consumption, which are published by Alessandra et al., could be integrated to predict the spatial and temporal distribution of traffic emissions and fuel consumption [18]. Overall, it can be seen from the mentioned studies that numerous researchers have proposed different models for estimating carbon dioxide emissions and fuel consumption using micro-scale methodologies, or Machine Learning and Deep Learning. The common vehicle characteristics for building these models are engine size, vehicle mass, and aerodynamic drag coefficient; and standard operating features used are acceleration and speed. The research trend generally emphasizes improving estimation models, rather than analyzing different vehicle types and brands using vehicle measurements, making it a limited market analysis for users and manufacturers. As a result, for a better knowledge of the vehicle market and its environmental effects, a comparative view of different types of vehicles and their influence on the environment is significant. Based on these metric analyses, recommended prediction models should be built using selective vehicle features. This identified gap provides the basis for the aim and objectives of this research. 3. Methodology 3.1. Macro Methodology In this study, to conduct an analytical and predictive study for fuel consumption and carbon dioxide emissions of vehicles, the dataset used is collected by the Government of Canada. A data analytics life cycle has been adopted for this research. This life cycle is a standard for Data Science and Big Data Analytics purposes, adopted from EMC Education Services [33], and contains 6 phases, as indicated in Figure 1. The first stage of this process is discovery, where the problem, context, hypothesis, and objectives that the data are used for are determined. The main goals of this study are to deliver a comparative view of fuel consumption and carbon dioxide emissions from different brands and vehicle models, to make evidence-based recommendations, and to construct a model to predict changes in the future consumption and emission rate. The dataset used in this study is derived from the ‘Fuel consumption rating’ datasets from the Government of Canada, which contains fuel consumption ranks and measured CO2 emissions for 4974 samples of light-duty vehicles in Canada [34]. The data were originally gathered from vehicle manufacturers, who compile the fuel consumption and CO2 rating data using standardized, monitored laboratory testing and analytical procedures. Then, a 5cycle testing process is used by manufacturers to simulate common driving conditions and styles. The approach also includes testing for city and highway driving, as well as driving in cold weather, using air conditioners, and driving at faster speeds with higher acceleration and braking [35]. Note that the CO2 and smog ratings given in the dataset were generated from the original ratings by manufacturers, not from vehicle testing. Consequently, the Appl. Sci. 2022, 12, 803 6 of 29 collected fuel consumption and CO2 consumption data from newly produced vehicles are used in this study for data analytics purposes. Figure 1. The data analytics life cycle. In Phase 2—Data Preparation, the dataset has then been processed and compressed into one single spreadsheet. By scoping down the research analysis, data of 4974 light-duty vehicles annually collected from 2017 to 2021 is merged, aggregated, with several renamed categories, including fuel consumption and carbon dioxide emissions from different brands, vehicle models, vehicle class, cylinders, engine size, transmission, fuel type, smog rating, and fuel consumption in a city and on a highway. Next, the dataset has been checked, and there are no issues or missing values. Subsequently, the dataset is cleaned to filter out if any data are not necessary for analysis purposes. For instance, one record is removed from the dataset since it is the only record containing the unique brand named ‘super’ (that can be considered an error record while there is no brand carrying that name), leading to a final 4973 record dataset. In Phase 3 and 4—Model Planning and Building, the dataset is analyzed and visualized by using four levels of data analytics methodology, including Descriptive Statistical Analysis, Inferential Statistical Analysis, Machine Learning, and Deep Learning methodology. Specific categories of all algorithms are discussed in the next Section 3.2. Finally, in Phases 5 and 6, relevant results on machine learning analytics and predictions are communicated and presented in detail in Sections 4 and 5 on Results and Discussion. Final reports, briefings, code snippets are also presented in the rest of this paper. 3.2. Micro Methodology In this paper, the “micro methodology” term refers to the micro-level data analysis methodology. This includes data analysis methods that are critically discussed (supported by embedded citations) by the measurements/approaches/algorithms that will be employed. In particular, four levels of data analytics are applied, as listed below. 3.2.1. Level 1: Descriptive Statistics This level comprises basic calculations of central tendency (mean, median, mode) and dispersion statistics (standard deviation, variance, range). A list of comparative statistics of fuel consumption and CO2 emission has been presented for each brand, model, engine size, vehicle class, transmission and cylinder type, and fuel type, giving a comprehensive Appl. Sci. 2022, 12, 803 7 of 29 outlook of emissions and consumption of various vehicle types and brands. The changes of the patterns through the years are also indicated before progressing to time-series changes of the greenest and the least environmental-friendly vehicle brand. 3.2.2. Level 2: Inferential Statistics The dataset is verified by different types of analytic testing for various purposes. • • • • t-test: has been conducted to compare the mean fuel consumption in the city and on the highway for the same vehicle; ANOVA: compares the means of total fuel consumption and carbon dioxide emissions for each vehicle class and fuel type over time to define whether each fuel type (or vehicle class) is significantly different from the rest; Correlation: A heat map of correlation coefficients is shown to illustrate the direction and strength of a linear relationship among vehicle features in pairs. Moreover, a comparison of the importance of features for predicting CO2 Emissions and Total Fuel Consumption has been conducted, which is an important test before advancing to Levels 3 and 4; Chi-Square: Two Chi-Square Goodness of Fit tests have been carried out to investigate whether there is a significant difference between the observed (data in 2021) and expected values (data from 2017 to 2020). Additionally, a chain of Chi-Square of Independence tests have been implemented to define relationships between all features to each other, therefore, presented in a heat map. 3.2.3. Level 3: Machine Learning In order to answer RQ3.1, input features have been used from the dataset to predict values in upcoming years: • Time Series Regression: has been used since it can forecast a future response using the historical responses and dynamics transition from related predictors. Different models are applied in this study, including persistence models (using walk forward validation), autoregression models (using autoregression function by statsmodels), and optimized autoregression model (using walk-forward over time steps). These models are evaluated by Root Means Square Error (RMSE) value, which measures the differences between values predicted and the values observed. To define whether Machine Learning models can use vehicle specifications data to predict their fuel consumption and CO2 emission (RQ3.2), different models are conducted in this study and classified into two groups: Machine Learning models to predict a variable from a variable; and models to predict a variable from multiple variables. For building Machine Learning models to estimate a variable from a single variable, data of engine size, number of cylinders, fuel consumption in a city and on a highway have been used to predict total fuel consumption and CO2 emissions. Moreover, total fuel consumption and CO2 emission data were used to predict each other. This research uses relevant methodologies to model relationships between those variables, which include: • • Linear Regression: using the sklearn model and the dataset is split into training and testing sets with 80%:20% ratio; Univariate Polynomial Regression: using the sklearn model and 5 different degrees (from Degree 1 to Degree 5). Regarding Machine Learning models used for estimating a variable from multiple variables, groups of data, including group A (model year, engine size, and cylinders) and group B (engine size and cylinders) have been used to predict total fuel consumption and CO2 emissions. Furthermore, data on fuel consumption in cities and highways were also used to estimate the total fuel consumption of vehicles. The applied models are listed as follows: Appl. Sci. 2022, 12, 803 8 of 29 • • • • • Multiple Linear Regression: using the sklearn model and the dataset is split into training and testing sets with 80%:20% ratio; Logarithmic Regression: using the sklearn model with log transformed predictor values and exponential transformed predictor values; Exponential Regression: the dataset is split into training and testing sets with 75%:25% ratio; Transformation of data: the dataset is split into training and testing sets with 75%:25% ratio; Multivariate Polynomial Regression: using the sklearn model and 5 different degrees (from Degree 1 to Degree 5). These models are chosen because many variables can be used at the same time to examine the statistical significance of each variable and transform them into independent variables. These forms of regression models also support the prediction of the dependent (or target) variables for later analysis [36]. In this paper, the coefficient of decision (R squared) value has been used to evaluate the above-mentioned models. The R squared value is a statistical measurement that examines how differences in one variable can be explained by differences in a second variable. Ranging from 0 to 1, the higher the R squared value, the better the model can be used for prediction. 3.2.4. Level 4: Deep Learning In addition, Convolutional Neural Network (CNN) is used in this study to predict a variable from multiple variables. Since CNN is normally used for image classification, to use CNN for regression problems, this research uses a one-dimensional convolutional network by reshaping input data. This enables the model to simulate numerical input data using learnable weights and biases [37]. The dataset has two dimensions that are the number of rows and columns (i.e., 4973 rows and 3 columns). Therefore, to reshape the data, a third dimension has been added as the number of the single input row (i.e., it becomes [4973, 3, 1]). Subsequently, the data are split into training and testing sets with an 80:20 ratio. Moreover, Keras is also applied to create a Conv1D class to add a one-dimensional convolutional layer into the model. Flatten and Dense layers are also supplemented and compiled with optimizers. Finally, the model can predict the test data with the trained model. This is evaluated by checking the mean squared error rate (MSE) of the predicted results. 4. Results and Discussion This section is structured based on the Micro Methodology mentioned in Section 3.2, and divided by four levels of data analytics. 4.1. Level 1: Descriptive Statistics The general purpose of this Level 1 is to observe 4973 light-duty vehicles from 2017 to 2021 by their fuel consumption and carbon dioxide emissions from different brands, vehicle models, vehicle class, cylinders, engine size, transmission, fuel type, smog rating, and fuel consumption in a city and on a highway. Recall that the CO2 and smog ratings in the dataset were calculated using manufacturer ratings rather than vehicle testing, and were ranked from worst (1) to best (10) with no unit. Firstly, in order to address RQ1.1 (How do light-duty vehicles compare in terms of fuel consumption and carbon dioxide emission?), descriptive statistics for all numerical columns in the dataset have been conducted to provide an evaluation of the data distribution. The purpose of descriptive statistics is to provide a statistical understanding of the dataset quality [36]. It can be seen from Table 1 that the average total fuel consumption is 10.86 L/100 km, of which 57.77% (12.36 L/100 km) from the city and 42.22% from the highway (9.04 L/100 km). Additionally, it is clear from the statistics that the average CO2 emissions of all vehicles are 251.44 g/km, with a standard deviation of 58.85 g/km. Ranking from worst (1) to best (10), the average CO2 rating is 4.60, and the average smog rating is Appl. Sci. 2022, 12, 803 9 of 29 4.63. Moreover, dispersion statistics of standard deviation and variance also indicate that the size of the distribution of values expected is reliable enough for prediction. Regarding the fuel consumption and carbon dioxide emission of different brands, their average data are indicated in Table 2. Table 1. Descriptive statistics of numerical columns of the dataset. Feature Mean Standard Deviation Min Max Variance Engine Size (L) Cylinders Fuel Consumption in City (L/100 km) Fuel Consumption in Highway (L/100 km) Total Fuel Consumption (L/100 km) CO2 Emissions (g/km) CO2 Rating Smog Rating 3.120 5.599 12.363 9.036 10.865 251.436 4.601 4.635 1.345 1.882 3.355 2.086 2.747 58.851 1.6588 1.807 1.0 3.0 4.0 3.9 4.0 94.0 1.0 1.0 8.4 16.0 30.3 20.9 26.1 608.0 10.0 8.0 1.809 3.542 11.256 4.351 7.548 363.459 2.752 3.265 In this dataset, the number of vehicles from Ford accounts for the highest with 436 vehicles, and the lowest amount is from Bugatti with 6 vehicles. After the descriptive statistical analysis, a bar chart is created, as presented in Figure 2, to demonstrate the average fuel consumption of different brands. It reveals that Honda consumes fuel the least (8.03 L/100 km), while Bugatti has the highest fuel consumption (22.98 L/100 km). Moreover, from Figures 3 and 4, Honda seems to be the greenest brand as it emits the least CO2 (187.58 g/km) and attains the highest CO2 rating (6.65), whereas Bugatti continues to perform poorly in its environmental-friendliness with the highest CO2 emissions (538.83 g/km) and the worst CO2 rating (1.00). Considering smog, Figure 5 proves that Volkswagen emits smog the least (6.45), and Bugatti seems to be the worst brand in terms of smog (1.00), fuel consumption, and CO2 emissions. Figure 2. Total fuel consumption (L/100 km) of each brand. Appl. Sci. 2022, 12, 803 10 of 29 Table 2. Average data of different vehicle brands. Brand Engine Size (L) Cylinders Total Fuel Consumption (L/100 km) CO2 Emissions (g/km) CO2 Rating Smog Rating Honda Mitsubishi Mazda Hyundai FIAT MINI Kia Volkswagen Toyota Subaru Volvo Acura Buick Alfa Romeo Nissan Lexus Audi Cadillac Jaguar Jeep Infiniti BMW Porsche Land Rover Lincoln Chrysler Mercedes-Benz Chevrolet Genesis Ford Ram GMC Dodge Maserati Aston Martin Bentley Rolls-Royce Lamborghini Bugatti 2.01 1.88 2.30 2.05 1.51 1.81 2.25 2.00 2.83 2.28 2.00 2.96 2.34 2.20 2.92 3.44 2.78 3.15 3.03 2.93 3.27 3.19 3.09 3.05 2.74 3.79 3.36 3.73 3.55 3.11 4.32 4.27 4.97 3.35 4.98 5.39 6.65 5.64 8.00 4.35 3.85 4.00 4.18 4.00 3.62 4.43 4.17 4.92 4.13 4.00 5.21 4.57 4.55 5.10 5.86 5.54 5.38 5.73 5.05 5.78 6.15 5.80 5.64 5.17 6.14 6.51 5.98 6.06 5.53 6.70 6.54 7.06 6.65 10.46 9.94 12.00 10.67 16.00 8.03 8.32 8.36 8.45 8.47 8.61 8.80 9.02 9.17 9.31 9.54 9.72 9.74 9.78 9.90 10.14 10.60 10.86 10.87 10.90 10.97 11.10 11.17 11.35 11.37 11.52 11.60 11.77 11.86 11.96 12.79 12.96 13.06 13.55 13.63 15.48 16.72 17.65 22.98 187.58 193.63 195.92 199.42 198.37 201.56 207.89 210.97 214.58 217.63 222.70 227.62 228.64 229.97 232.59 237.21 247.67 255.29 256.47 254.74 257.67 260.01 260.98 272.23 266.92 252.12 271.25 268.15 279.48 264.23 294.59 291.36 295.52 317.29 320.50 361.67 390.95 410.79 538.83 6.65 6.29 6.23 6.17 6.11 5.86 5.94 5.67 5.87 5.42 5.14 5.06 5.05 5.00 5.17 4.90 4.59 4.32 4.38 4.36 4.25 4.31 4.19 3.91 4.17 4.40 3.99 4.19 3.76 4.16 3.45 3.51 3.35 2.77 2.96 2.00 1.03 1.54 1.00 4.65 5.38 5.80 5.14 4.69 6.13 5.09 6.45 5.48 4.34 5.44 4.40 5.30 3.09 4.99 5.40 4.68 5.18 6.21 4.67 4.13 4.50 2.84 5.07 5.19 4.65 4.66 4.47 4.24 4.56 3.77 4.38 2.99 2.04 3.58 3.30 3.62 1.77 1.00 Figure 3. CO2 emissions (g/km) of each brand. Appl. Sci. 2022, 12, 803 11 of 29 Figure 4. CO2 rating of each brand. Figure 5. Smog rating of each brand. Regarding fuel consumption and CO2 emissions of different models, Table 3 explains that the IONIQ BLUE model consumes and emits the least, and in contrast, the CHIRON PUR SPORT model consumes and emits the most. Similarly, when considering fuel consumption and CO2 emissions, Tables 4–8 showcase that Station wagon (Small) class, Engine Size 1.2L, 3 Cylinders, Transmission Type AV1, and Fuel Type D (Diesel) consume fuel and emit CO2 the least. Conversely, Van (Passenger) class, Engine Size 8.0, 16 Cylinders, Transmission Type A6, and Fuel Type E (Ethanol E85) seem to be the most consumers and emitters. However, since the Volkswagen emissions scandal emerged, the negative image of diesel has intensified. The actual NO and PM emissions of diesel vehicles, according to recent researchers, are significantly greater than those reported. Because of carcinogenic compounds, diesel particle emissions are also a possible health danger [38]. Therefore, the conclusion that Ethanol E85 emits the most among other fuel types remains the scope of the data in this research. Appl. Sci. 2022, 12, 803 12 of 29 Table 3. CO2 emissions (g/km) and total fuel consumption (L/100 km) of each model. Model Total Fuel Consumption (L/100 km) CO2 Emissions (g/km) IONIQ BLUE IONIQ PRIUS 4.08 4.28 4.48 ... 22.40 23.00 26.10 95.60 101.40 105.40 AVENTADOR COUPE SVJ DIVO CHIRON PUR SPORT 520.00 537.00 608.00 Table 4. CO2 emissions (g/km) and total fuel consumption (L/100 km) of each vehicle class. Vehicle Class Total Fuel Consumption (L/100 km) CO2 Emissions (g/km) Station wagon: Small Compact Mid-size SUV: Small Minicompact Subcompact Special purpose vehicle Station wagon: Mid-size Full-size Minivan Pickup truck: Small Two-seater SUV: Standard Pickup truck: Standard Van: Passenger 8.25 9.22 9.55 10.01 10.35 10.64 10.77 10.86 11.16 11.30 11.66 12.45 13.25 13.48 16.98 193.85 215.69 223.49 233.65 242.16 248.95 236.90 254.41 256.36 257.98 281.61 291.33 303.00 300.05 362.63 Table 5. CO2 emissions (g/km) and total fuel consumption (L/100 km) of each engine size. Engine Size (L) Total Fuel Consumption (L/100 km) CO2 Emissions (g/km) 1.2 1.6 1.8 6.66 7.38 7.61 155.11 176.19 178.19 6.8 6.5 8.0 18.62 20.62 22.98 ... 434.40 478.25 538.83 Table 6. CO2 emissions (g/km) and total fuel consumption (L/100 km) of each cylinder. Cylinders Total Fuel Consumption (L/100 km) CO2 Emissions (g/km) 3 4 5 6 8 10 12 16 7.78 8.85 10.37 11.49 14.00 15.09 16.60 22.98 181.78 207.12 242.43 265.59 318.05 353.19 388.24 538.83 Appl. Sci. 2022, 12, 803 13 of 29 Table 7. CO2 emissions (g/km) and total fuel consumption (L/100 km) of each transmission type. Transmission Total Fuel Consumption (L/100 km) CO2 Emissions (g/km) AV1 AV AM6 AV10 AV6 M5 AV7 A4 AV8 M6 AS6 AS9 A9 AM9 AS8 AM8 M7 AM7 AS7 AS10 A8 A10 A5 AS5 A6 A7 6.82 7.13 7.35 7.75 8.02 8.23 8.29 9.05 9.05 9.95 10.39 10.57 10.87 11.00 11.13 11.18 11.32 11.33 12.08 12.31 12.35 12.60 12.95 13.11 13.15 13.26 161.50 167.14 171.33 181.29 187.15 191.55 194.37 212.50 211.49 233.09 237.62 247.82 253.26 259.75 260.67 261.78 264.73 265.04 282.10 277.96 286.17 304.13 295.37 305.64 288.23 310.85 Table 8. CO2 emissions (g/km) and total fuel consumption (L/100 km) of each fuel type. Fuel Type Total Fuel Consumption (L/100 km) CO2 Emissions (g/km) D (Diesel) X (Regular gasoline) Z (Premium gasoline) E (Ethanol E85) 9.32 9.98 11.47 16.62 250.52 234.05 268.38 275.43 Secondly, to answer RQ1.2 (How have patterns of consumption and emission of each vehicle type changed throughout the selected period?), descriptive statistics have been conducted for total CO2 emissions and fuel consumption through the period of 2017 to 2021 in Table 9 in general. It can be seen from Table 9 that the total fuel consumption gradually increases from 2017 to 2020, before a significant drop in 2021. However, the peak in 2020 does not exist in the CO2 emissions, and the value steadily rises over the entire period. Table 9. CO2 emissions (g/km) and total fuel consumption (L/100 km) over time. Model (Year) Total Fuel Consumption (L/100 km) CO2 Emissions (g/km) 2017 2018 2019 2020 2021 10.87 10.85 10.86 10.90 10.84 250.02 250.04 251.17 253.10 253.48 Appl. Sci. 2022, 12, 803 14 of 29 From Table 10, it can be seen a similar pattern of gradually increasing from 2017 to 2020 before significantly dropping in the data of engine size, cylinders, fuel consumption in the city, and the total. The highway fuel consumption and in total (mpg) and CO2 emission observe a continuous rise over the years. That could explain a gradual decrease in CO2 rating during the period. Finally, smog rating dramatically is reduced in 2018, before continuously growing until 2021. Table 10. Average feature data over time. Model (Year) 2017 2018 2019 2020 2021 Engine Size (L) Cylinders Fuel Consumption in City (L/100 km) Fuel Consumption in Highway (L/100 km) Total Fuel Consumption (L/100 km) Total Fuel Consumption (mpg) CO2 Emissions (g/km) CO2 Rating Smog Rating 3.11 5.54 12.42 8.98 10.87 27.67 250.02 4.83 6.04 3.11 5.60 12.36 8.99 10.85 27.65 250.04 4.57 3.78 3.10 5.59 12.37 9.03 10.86 27.66 251.17 4.56 4.14 3.16 5.67 12.38 9.10 10.90 27.63 253.10 4.53 4.52 3.12 5.60 12.27 9.10 10.84 27.86 253.48 4.48 4.72 In this research, it is evident that Honda is the greenest brand, and it is essential to analyze its pattern of consumption and emission through the years. From Figure 6, in 2018, Honda seems to have optimized fuel consumption and carbon dioxide emissions of their products. Although the data in 2019 and 2020 show a slight increase, it dramatically drops again in 2021. Figure 6. CO2 emissions (g/km) and total fuel consumption (L/100 km) of Honda over time. Given the same analysis on the brand that has demonstrated to possess the least environmental awareness, Bugatti has never considered optimizing their products’ consumption and emission, proven by the significant growth in total fuel consumption and CO2 emission shown in Figure 7. Figure 7. CO2 emissions (g/km) and total fuel consumption (L/100 km) of Bugatti over time. Considering the fuel consumption of each fuel type during the years, it can be seen from Figure 8 that Fuel Type E (Ethanol E85) and Z (Premium gasoline) always consume Appl. Sci. 2022, 12, 803 15 of 29 more than Fuel Type X (Regular gasoline) and D (Diesel). Over the period, Fuel Type D (Diesel), E (Ethanol E85), and Z (Premium gasoline) all have increased their consumption, whereas Fuel Type X (Regular gasoline) has a slight decrease, thus having the least fuel usage in 2021. Figure 8. Total fuel consumption (L/100 km) of each fuel type over time. 4.2. Level 2: Inferential Statistics 4.2.1. t-Test To address RQ2.1 (Is there any particular distribution for fuel consumption in the city and the highway of vehicles in Canada?), a two-tailed T-test has been conducted to compare the means of fuel consumption in the city and on the highway for the same vehicle, with the following configurations. • • • Null Hypothesis (H0): mean of fuel consumption in the city = mean of fuel consumption on a highway; Alternative Hypothesis (Ha): mean of fuel consumption in a city 6= mean of fuel consumption in highway; Chosen confidence level: 99%, which means α = 0.01. After the test, the result showed that: • • Statistic = 149.8128 (t-value); p-value = 0.0. It is clear that: p-value = 0.0 < α/2 = 0.005. (1) Therefore, the null hypothesis can be rejected. This means the mean of fuel consumption in a city and on a highway for the same individual has a significant difference. 4.2.2. ANOVA To answer RQ2.2 (Is there a notable difference in the performance of one specific fuel type (or vehicle type) in comparison to the rest of the vehicle types in Canada?), a one-way ANOVA one-tailed test was implemented to compare the means of each vehicle class in terms of total fuel consumption, using the following assumptions. • • • The samples are not dependent; Each sample comes from a population that is normally distributed; The group population standard deviations are all equal (homoscedasticity). Firstly, the means of total fuel consumption for each class through the years is calculated based on the descriptive statistics method, as shown in Figure 9. The following configurations have been set out. • • Null Hypothesis (H0): means of each vehicle class are the same; Alternative Hypothesis (Ha): At least one of the means for each class is not equal to the other; Appl. Sci. 2022, 12, 803 16 of 29 • Chosen Confidence Level: 99%, which means α = 0.01 Figure 9. Total fuel consumption distribution of vehicle classes over time. After the test, the result showed that: p-value = 2.3552 × 10−27 < α = 0.01. (2) Therefore, the null hypothesis can be rejected, meaning that at least one mean of total fuel consumption for each vehicle class is significantly different from the rest. Similarly, using the same assumptions, hypothesis, and confidence level, one-way ANOVA one-tailed tests have been conducted in CO2 emissions and fuel consumption of each vehicle class and fuel type (Figures 10 and 11, respectively) of each fuel type, and each result is presented as the following. Figure 10. CO2 emissions of each vehicle class over time. Appl. Sci. 2022, 12, 803 17 of 29 p-value = 6.81894 × 10−27 < α = 0.01. (3) Consequently, the null hypothesis can be rejected, meaning that at least one mean of CO2 emissions for each vehicle class is significantly different from the rest. Figure 11. Total fuel consumption and emissions of each fuel type over time. Total fuel consumption of each fuel type over time: p-value = 1.3362 × 10−13 < α = 0.01. (4) Therefore, the null hypothesis can be rejected, meaning that at least one mean of total fuel consumption for each fuel type is significantly different from the rest. Emissions of each fuel type over time: p-value = 5.5127 × 10−05 < α = 0.01. (5) From that comparison, the null hypothesis can be rejected, meaning that at least one mean of CO2 emissions for each fuel type is significantly different from the rest. 4.2.3. Correlation To define the strength of the relationship among two features in the dataset and address RQ2.3 (How the brand, model, vehicle class, cylinder, engine size, transmission type, and fuel type correlate with emissions and consumption of various vehicles?), a correlation algorithm has been introduced to generate correlation coefficients. The most commonly used algorithm of this type in statistics is Pearson correlation, which estimates the direction and strength of a linear relationship among two variables [39]. In this study, the objective of this statistic is to define which parameter has the strongest correlation with the total fuel consumption and CO2 emission. To achieve this, Pearson’s correlation coefficients have been applied and computed between all features through all vehicles and presented in a correlation heat map shown in Figure 12. From the heat map in Figure 12, all the correlation coefficients have been calculated, showing the correlation between corresponding parameters on the left and the corresponding parameters at the bottom. The higher the correlation coefficient, the warmer color was presented. Moreover, Figures 13 and 14 below reveal the importance of all features on estimating total fuel consumption and CO2 emissions by using bar charts. Appl. Sci. 2022, 12, 803 18 of 29 Figure 12. Heatmap of correlation between all dataset parameters. It is seen from Figures 13 and 14 that besides the fuel consumption features in the highway and the city (the two most important features), engine size gives the highest correlation for estimating total fuel consumption, whereas cylinders, year, and smog rating are nearly half as important, compared to engine size. For estimating carbon dioxide emission, engine size, year, and smog rating are important features. This finding contributes as an influential factor in building Machine Learning and Deep Learning models presented in Levels 3 and 4. Figure 13. Importance of features on predicting total fuel consumption. Appl. Sci. 2022, 12, 803 19 of 29 Figure 14. Importance of features on predicting CO2 emissions. 4.2.4. Chi-Square Chi-Square is a non-parametric test, which is divided into two different types: ChiSquare Goodness of Fit and Chi-Square of Independence. The purpose of Chi-Square Goodness of Fit is to compare the observed and expected values from one categorical variable. Meanwhile, Chi-Square of Independence defines whether there is an association among categorical variables, meaning that the variables are related or independent, known as the Chi-Square Test of Association [40]. To implement the Chi-Square Goodness of Fit test, the dataset is split into the period of 2017 to 2020, used for testing the predictions of 2021 whether there is a significant difference between the observed and expected values. First, the Chi-Square Goodness of Fit Test is applied to compare the Total Fuel Consumption by Vehicle Class between expected (from 2017 to 2020) and observed (2021) using a confidence level of 98% (α = 0.02), and the results attained are discussed below. • • Chi-Square value: 0.5317; p-value: 0.4659. It can be seen that: p-value = 0.47 > α = 0.02. (6) Therefore, the null hypothesis can be accepted, meaning that there is no significant difference between the observed and expected values. A similar Chi-Square Goodness of Fit Test is conducted for comparing Total Fuel Consumption by Fuel Type in expected (from 2017 to 2020) and observed (2021) with the following outputs. • • The Chi-Square value is: 6.3380; p-value: 0.0118. p-value = 0.012 < α = 0.02. (7) Therefore, the null hypothesis can be rejected, meaning that there is a significant difference between the observed and expected values. Next, to address RQ2.4 (What are the relationships between all features to each other of the entire dataset?), the Chi-Square of Independence Test was conducted to ascertain whether there is a relationship between fuel type and CO2 rating and the results are the following. • • • The Chi-Square value is: 765.5951; The p-value is: 6.6296 × 10−144 ; The degree of freedom is: 27. It is perceived that p-value = 6.63 × 10−144 < α = 0.02. (8) With the chosen confidence level of 98%, the null hypothesis is rejected, and there is a relationship between fuel type and CO2 rating. A chain of similar Chi-Square of Independence tests have also been implemented to define relationships amongst all features and are presented in a correlation heat map Appl. Sci. 2022, 12, 803 20 of 29 shown in Figure 15. In the heat map, all the correlation coefficients have been calculated and indicated as 1, if there is a relationship between corresponding parameters on the left and the corresponding parameters at the bottom, and indicated as 0 if there is no relationship among them. It reveals that there is some form of relationship amongst almost all features except that there is no relationship between year and model, cylinders, and total fuel consumption (mpg). Through this test, it is concluded that all the features from the chosen dataset can be used for prediction models proposed in Level 3 and 4, and year can be used as a time index for the estimation. Figure 15. Heat map for Chi-square of Independence tests between all features. 4.3. Level 3: Machine Learning 4.3.1. Time Series Regression This subsection aims to answer RQ3.1 (Can fuel consumption and carbon dioxide emission data and other input metrics be utilized to predict outputs in upcoming years in Canada?). To determine which Machine Learning models can be used for predicting fuel consumption and carbon dioxide emission, different experiments were conducted, as presented below. Appl. Sci. 2022, 12, 803 21 of 29 Firstly, all the input features from the dataset are used to calculate their mean values over time, as shown in Figure 16. Figure 16. All input metrics over time. Secondly, using the correlation results from Section 4.2.3, this study builds the following models to predict the fuel consumption (in city, highway, and total) and CO2 emissions of an average vehicle in Canada in the four upcoming years. • • • Persistence models (using walk-forward validation); Autoregression models (using autoregression function by statsmodels); Optimized autoregression model (using walk-forward over time steps). The prediction results of these models are presented in Figure 17 and Table 11. Table 11. Root Means Square Error (RMSE) of different regression models. Metric Persistence Model Autoregression Model Optimized Autoregression Model Total Fuel Consumption Fuel Consumption in City Fuel Consumption in Highway CO2 Emission 0.002 0.004 0.002 1.287 0.026 0.045 0.097 3.412 0.026 0.044 0.068 2.178 It can be observed from Table 11 that the autoregression model always has the highest RMSE. The optimized autoregression model has lower values, while the persistence model has the lowest values. The persistence model predicts that total fuel consumption and CO2 emission will increase in the next four years. However, fuel consumption in the city is projected to decline, while the data in highways are expected to grow firmly. The rest of the following Machine Learning models have been constructed to answer RQ3.2 (Is it possible to build Machine Learning models that use vehicle specifications data to predict their fuel consumption and carbon dioxide emission?). Appl. Sci. 2022, 12, 803 22 of 29 Figure 17. Prediction results of different regression models. 4.3.2. Linear Regression and Univariate Polynomial Regression These methodologies have been applied to build models that predict total CO2 emissions and fuel consumption of vehicles from a single input (engine size, or the number of cylinders, etc.), and the result is presented in Table 12 and Figure 18. The coefficient of determination is ranged from 1 to 10, from worst to perfect prediction. Table 12. Coefficient of determination (R squared) values of Linear Regression and Univariate Polynomial Regression models. Predictor Target Linear Regression Engine Size Cylinders Fuel Consumption in City Fuel Consumption in Highway CO2 Emissions Total Fuel Con sumption (L/100 km) Engine Size Cylinders Fuel Consumption in City Fuel Consumption in Highway Total Fuel Consumption CO2 Emissions (g/km) Univariate Polynomial Regression Degree 1 Degree 2 Degree 3 Degree 4 Degree 5 0.67694 0.66161 0.98443 0.94780 0.89053 0.67670 0.64166 0.98606 0.94710 0.88828 0.68466 0.65108 0.98606 0.94778 0.88851 0.68611 0.65165 0.98624 0.94783 0.88859 0.69022 0.65595 0.98626 0.94790 0.88894 0.69038 0.65596 0.98626 0.94794 0.88894 0.72950 0.67752 0.88922 0.82471 0.88753 0.70852 0.69280 0.88654 0.82107 0.88828 0.71446 0.69839 0.89650 0.84835 0.90243 0.72162 0.69962 0.89724 0.84839 0.90289 0.72480 0.70195 0.90846 0.85369 0.91193 0.72552 0.70195 0.90886 0.85448 0.91215 It can be seen from Table 12 that the Univariate Polynomial Regression Degree 5 model achieves the highest coefficient of determination (R squared) in 7 out of 10 scenarios. Being Appl. Sci. 2022, 12, 803 23 of 29 insignificantly different from it, the Linear Regression almost attains the same R squared value and at the same time, obtains the highest in 3 out of 10 scenarios. 4.3.3. Multiple Linear Regression, Logarithmic Regression, Multivariate Polynomial Regression, Transformation of Data, and Exponential Regression These models are selected to estimate total CO2 emissions and fuel consumption of vehicles from multiple inputs, and the result is presented in Table 13. Table 13 shows that in 3 out of 5 cases, the Multiple Linear Regression model has the largest coefficient of decision (R squared). Despite being insignificantly different from it, the Linear Regression comes close to attaining the same R squared value and also achieves the best score in 2 out of 5 scenarios (at Degree 2 and 5). On the other hand, the Logarithmic Regression with Log Transformation model receives lower determination scores in all scenarios. Notably, the Logarithmic Regression with Exponential Transformation model generates negative R squared values in all cases, implying that the goodness of fit level is worse than fitting the curve of the model. Figure 18. Scatterplot of prediction outputs of Linear Regression model in different scenarios. Appl. Sci. 2022, 12, 803 24 of 29 In this subsection, different Machine Learning models are applied to use vehicle specifications data for fuel consumption and carbon dioxide emission estimation. It is recognized that Linear Regression, Multiple Linear Regression, Univariate Polynomial Regression, and Multivariate Polynomial Regression are very potential in this field, which answered the research question RQ3.2. Table 13. Coefficient of determination (R squared) values of Multiple Linear Regression, Logarithmic Regression, Multivariate Polynomial Regression, Transformation of data, and Exponential Regression models. Predictor Model (Year) + Engine Size (L) + Cylinders Target Total Fuel Consumption (L/100 km) Engine Size (L) + Cylinders Fuel Consumption in City (L/100 km) + Fuel Consumption in Highway (L/100 km) Model (Year) + Engine Size (L) + Cylinders Engine Size (L) + Cylinders CO2 Emissions (g/km) Multiple Linear Regression Logarithmic Regression Univariate Polynomial Regression Log Transformation Exponential Transformation Degree 1 Degree 2 Degree 3 Degree 4 Degree 5 0.68184 0.61418 −0.31802 0.68658 0.69331 0.69174 0.70389 0.67582 0.71549 0.99968 0.62154 0.55998 −0.31802 −0.31802 0.68728 0.99968 0.69041 0.99968 0.69018 0.99968 0.70343 0.99968 0.71083 0.99968 0.74119 0.49410 −0.04007 0.71355 0.71902 0.72576 0.72994 0.70450 0.73955 0.42943 −0.04007 0.71247 0.71506 0.72388 0.72922 0.73300 4.4. Level 4: Deep Learning Convolutional Neural Network To address RQ4.1, a Convolutional Neural Network (CNN) [41,42] has been employed in this study to estimate the total CO2 emissions and fuel consumption of vehicles from multiple inputs. CNN is a form of deep neural network that is often used to explore visual imagery [37,43]. The deep learning model has been built using Google Collab and results are presented in Figure 19 and Table 14. Table 14. Coefficient of determination (R squared) values of Convolutional Neural Network. Predictor Target Convolutional Neural Network Model (Year) + Engine Size (L) + Cylinders Engine Size (L) + Cylinders Fuel Consumption in City (L/100 km) + Fuel Consumption in Highway (L/100 km) Total Fuel Consumption (L/100 km) 0.70061 0.69482 0.99964 Model (Year) + Engine Size (L) + Cylinders Engine Size (L) + Cylinders CO2 Emissions (g/km) 0.68912 0.71746 It can be seen from Table 14 that the CNN model always delivers stable and high coefficient of determination values in all scenarios. Compared with Table 13, while the CNN model is yet to reach the highest R squared score, in any case, the model is likely to attain it with stable predictions. Moreover, Figure 19 demonstrates that the CNN model could predict with high accuracy. Appl. Sci. 2022, 12, 803 25 of 29 Figure 19. Scatterplot of prediction outputs of Convolutional Neural Network model in different scenarios. 5. Recommendations Through a series of rigorous data analyses, the study has showcased the current trend and comparative analysis of fuel consumption and carbon dioxide emissions from different brands and vehicle features. A list of recommendations for customers who currently wish to buy new vehicles is as follows: • • Fuel-saver and environmental-friendly brands: Honda, Mitsubishi, Mazda, FIAT, Hyundai, MINI, Kia, and Volkswagen; Least smog-emitter brands: Volkswagen, Jaguar, MINI, Mazda, Toyota, Volvo, and Lexus. Conversely, customers who are environmental friendly ought to reconsider the following brands: • • Brands with high fuel consumption and CO2 emissions: Bugatti, Lamborghini, RollsRoyce, Bentley, Aston Martin, Maserati, and Dodge; Brands with high smog emissions: Bugatti, Lamborghini, Maserati, Porsche, Dodge, Alfa Romeo, and Bentley. Recommendations for both vehicle producers and customers who strive to be green in their products are as follows: • • • • • Engine models: IONIQ Blue, IONIQ, Prius, Corolla Hybrid, And Niro FE; Suggested Vehicle Classes: Station wagon (Small), Compact, Mid-size, and SUV (Small); For engine size and cylinder, the smaller, the better for fuel consumption and CO2 emissions; Suggested transmission type: AV1, AV, AM6, AV10, and AV6; About Fuel type, it is recommended to use fuel types D (Diesel) and X (Regular gasoline). Due reconsideration has to be made regarding the following products in terms of their negative environmental impacts: Appl. Sci. 2022, 12, 803 26 of 29 • • • • • Engine models: Chiron PUR Sport, Divo, Aventador Coupe S, Aventador Coupe SVJ, and Aventador Roadster S; Vehicle Classes that have high fuel consumption and CO2 emission: Van (Passenger), Pickup truck: Standard, and SUV: Standard; For engine size, the bigger, the worse for fuel consumption and CO2 emissions; Not recommended transmission type: A7, AS5, A10, A5, A6, and A8; About Fuel type, it is not recommended to use fuel types Z (Premium gasoline) and E (Ethanol E85). From the findings of our in-depth statistics and analysis of different Machine Learning and Deep Learning model, there are several evidence-based recommendations. First, it is possible to use engine size and the number of cylinders to estimate CO2 emissions and fuel consumption of future vehicle designs, with a relatively high determination coefficient, around 70%. Moreover, fuel consumption and CO2 emission data can be used to predict each other, with every high accuracy in most cases, up to 91.22%. Secondly, different Machine Learning models, including Linear Regression, Multiple Linear Regression, Univariate Polynomial Regression, and Multivariate Polynomial Regression have potential to predict the CO2 emission and fuel consumption of light-duty vehicles. However, it is suggested to apply Convolutional Neural Network for the prediction, which is proven to predict stably with relatively high accuracy of around 70%. Prediction results from the Machine Learning and Deep Learning models in this paper can be used as an index and a reference for relevant predictors, that can be used for different stakeholders in the upcoming actions. Moreover, the models can be applied to other air pollutants of the vehicle exhausts, including CO, NOx, SO2, PM, etc. 6. Conclusions and Future Work In this research, an observational and predictive analysis has been performed using data from the Government of Canada, which includes 4973 light-duty vehicles observed between 2017 and 2021, to provide a comparative view of various brands and vehicle types in terms of fuel consumption and CO2 emissions before making applicable recommendations. Despite significant efforts that have been developed in the past [10,19,27], this research analyzes different vehicle types and brands using vehicle measurements, providing a deeper understanding of the vehicle market and its environmental effects. The proposed vehicle features and recommended prediction models in this study can be further used as a reference for vehicle manufactures and users to make relevant actions for reducing their environmental impacts. By using descriptive and inferential statistics methodologies, it is observed that the average total fuel consumption of light-duty vehicles is 10.86 L/100 km, and the average CO2 emission is 251.44 g/km. Different brands and vehicle features have been included in a rigorous, as well as comprehensive, analysis. Based on the findings, relevant recommendations have been made. Over the study period, some vehicle brands have been working towards optimizing their products with environmental awareness (such as Honda), while some are doing conversely (including Bugatti). Moreover, different machine learning and deep learning models have been built throughout this study for fuel consumption and CO2 emission prediction. Firstly, this study reveals that the Persistence model has outperformed the autoregression and optimized autoregression models for predictions from one input variable with vector autoregression. Additionally, the Univariate Polynomial Regression model (degree 5) attains a higher coefficient of determination, compared to the model itself with lower degrees and Linear Regression model. Secondly, for estimating total fuel consumption and CO2 emissions of vehicles from multiple inputs, the Multiple Linear Regression and Multivariate Polynomial Regression have been demonstrated to be the best models, compared to Logarithmic Regression (with Log and Exponential Transformation). Finally, it should be noted that Convolutional Neural Network is also promising for predicting in this field, with stable and high coverage of correct predicted values. Appl. Sci. 2022, 12, 803 27 of 29 Future research may gear towards developing higher performance models for predicting fuel consumption and CO2 emissions. Moreover, a larger dataset with more vehicle features should be studied for building a predictive model in vehicle design. Based on that, APIs and applications can be designed and constructed for predictions. Finally, vehicle consumers and producers can adopt the recommendations from the findings of this study to design, as well as implement appropriate action plans for reducing their environmental impacts. Author Contributions: N.L.H.H. and A.-L.K. contributed to conceptualization, software, validation, resources, and methodology; N.L.H.H. contributed to formal analysis, investigation, data curation, writing—original draft preparation, and visualization; A.-L.K. contributed to writing—review and editing, project administration, and funding acquisition. All authors have read and agreed to the published version of the manuscript. Funding: This research and the APC were funded by European Commission grant numbers 612462EPP-1-2019-1-SK-EPPKA2-KA and 610619-EPP-1-2019-1-FR-EPPKA1-JMD-MOB. Institutional Review Board Statement: Not applicable. Informed Consent Statement: Not applicable. Data Availability Statement: The data used to analyze in this paper can be found in this link https://open.canada.ca/data/en/dataset/98f1a129-f628-4ce4-b24d-6f16bf24dd64 (accessed on 30 November 2021). Conflicts of Interest: The authors declare no conflicts of interest. Abbreviations The following abbreviations are used in this manuscript: ANOVA BP CMEM CNN CO CO2 EMIT EU Fuel Type D Fuel Type E Fuel Type N Fuel Type Z Fuel Type X GHG H0 Ha HC MEASURE MOVES NOx OBD RMSE RO RQ SVR US Analysis of variance Backpropagation Comprehensive Modal Emissions Model Convolutional Neural Network Carbon Monoxide Carbon Dioxide Emissions from Traffic European Union Diesel Ethanol (E85) Natural gas Premium gasoline Regular gasoline Greenhouse Gases Null Hypothesis Alternative Hypothesis Hydrocarbon Mobile Emission Assessment System for Urban and Regional Evaluation Motor Vehicle Emission Simulator Nitrogen Oxides On-Board Diagnostic Root Means Square Error Research Objective Research Question Support Vector Regression United States Appl. Sci. 2022, 12, 803 28 of 29 References 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15. 16. 17. 18. 19. 20. 21. 22. 23. 24. 25. 26. 27. 28. De Vos, J.; Cheng, L.; Kamruzzaman, M.; Witlox, F. The indirect effect of the built environment on travel mode choice: A focus on recent movers. J. Transp. Geogr. 2021, 91, 102983. [CrossRef] Straka, W.; Kondragunta, S.; Wei, Z.; Zhang, H.; Miller, S.D.; Watts, A. Examining the economic and environmental impacts of covid-19 using earth observation data. Remote Sens. 2021, 13, 5. [CrossRef] Intergovernmental Panel on Climate Change. The Fifth Assessment Report of IPCC; IPCC: Geneva, Switzerland, 2019. European Environment Agency. Final Energy Consumption by Sector and Fuel; European Environment Agency: Brussels, Belgium, 2015. Yang, Z.; Bandivadekar, A. Light-Duty Vehicle Greenhouse Gas and Fuel Economy Standards; International Council on Clean Transportation: Washington, DC, USA, 2017; p. 16. Guensler, R. Data Needs for Evolving Motor Vehicle Emission Modeling Approaches; The University of California Transportation Center: Berkeley, CA, USA, 1993; pp. 167–228. Qi, Y.G.; Teng, H.H.; Yu, L. Microscale emission models incorporating acceleration and deceleration. J. Transp. Eng. 2004, 130, 348–359. [CrossRef] Kan, Z.; Tang, L.; Kwan, M.P.; Zhang, X. Estimating vehicle fuel consumption and emissions using GPS big data. Int. J. Environ. Res. 2018, 15, 566. [CrossRef] [PubMed] Zhao, Q.; Chen, Q.; Wang, L. Real-Time Prediction of Fuel Consumption Based on Digital Map API. Appl. Sci. 2019, 9, 1369. [CrossRef] Yao, Y.; Zhao, X.; Liu, C.; Rong, J.; Zhang, Y.; Dong, Z.; Su, Y. Vehicle fuel consumption prediction method based on driving behavior data collected from smartphones. J. Adv. Transp. 2020, 2020, 9263605. [CrossRef] Schoen, A.; Byerly, A.; Hendrix, B.; Bagwe, R.M.; dos Santos, E.C.; Miled, Z.B. A machine learning model for average fuel consumption in heavy vehicles. IEEE Veh. Technol. Mag. 2019, 68, 6343–6351. [CrossRef] Ntziachristos, L.; Mellios, G.; Tsokolis, D.; Keller, M.; Hausberger, S.; Ligterink, N.; Dilara, P. In-use vs. type-approval fuel consumption of current passenger cars in Europe. Energy Policy 2014, 67, 403–411. [CrossRef] UN Environment, Electric Light Duty Vehicles. UNEP. 2021. Available online: https://www.unep.org/explore-topics/transport/ what-we-do/electric-mobility/electric-light-duty-vehicles (accessed on 30 November 2021). European Commission. 2030 Climate and Energy Framework. Climate Action. 2022. Available online: https://ec.europa.eu/ clima/eu-action/climate-strategies-targets/2030-climate-energy-framework_en (accessed on 30 November 2021). European Commission. 2050 Long-Term Strategy. Climate Action. 2022. Available online: https://ec.europa.eu/clima/euaction/climate-strategies-targets/2050-long-term-strategy_en (accessed on 30 November 2021). Government of Canada. Net-Zero Emissions by 2050. 2021. Available online: https://www.canada.ca/en/services/environment/ weather/climatechange/climate-plan/net-zero-emissions-2050.html (accessed on 30 November 2021). Lederer, P.R. Analysis and Prediction of Individual Emissions-Producing Vehicle Activity for Light-Duty Vehicles and Light-Duty Trucks on Freeway Entrance Ramps; University of Louisville: Louisville, KY, USA, 2001. Cappiello, A.; Chabini, I.; Nam, E.K.; Lue, A.; Abou Zeid, M. A statistical model of vehicle emissions and fuel consumption. In Proceedings of the IEEE 5th International Conference on Intelligent Transportation Systems, Singapore, 6 September, 2002; pp. 801–809. United States Environmental Protection Agency. Latest Version of MOtor Vehicle Emission Simulator (MOVES); Technical Report; EPA: Washington, DC, USA, 2020. Rakha, H.; Ahn, K.; Moran, K.; Saerens, B.; Van den Bulck, E. Simple Comprehensive Fuel Consumption and CO2 Emissions Model Based on Instantaneous Vehicle Power; Technical Report; TRIB: Washington, DC, USA, 2011. So, J.; Motamedidehkordi, N.; Wu, Y.; Busch, F.; Choi, K. Estimating emissions based on the integration of microscopic traffic simulation and vehicle dynamics model. Int. J. Sustain. Transp. 2018, 12, 286–298. [CrossRef] Hung, W.T.; Tong, H.Y.; Cheung, C.S. A modal approach to vehicular emissions and fuel consumption model development. J. Air Waste Manag. Assoc. 2005, 55, 1431–1440. [CrossRef] [PubMed] Fomunung, I.; Washington, S.; Guensler, R. Comparison of MEASURE and MOBILE5a predictions using laboratory measurements of vehicle emission factors. In Transportation Planning and Air Quality IV: Persistent Problems and Promising Solutions; American Society of Civil Engineers: Reston, VA, USA, 2000. Ntziachristos, L.; Gkatzoflias, D.; Kouridis, C.; Samaras, Z. COPERT: A European road transport emission inventory model. In Information Technologies in Environmental Engineering; Springer: Berlin/Heidelberg, Germany, 2009; pp. 491–504. Ntziachristos, L.; Samaras, Z.; Eggleston, S.; Gorissen, N.; Hassel, D.; Hickman, A. Copert iii. In Computer Programme to Calculate Emissions from Road Transport; Methodol. Emiss. Factors (Version 2.1), Eur. Energy Agency (EEA), Cph.; European Energy Agency: Copenhagen, Denamrk, 2000. Tóth-Nagy, C.; Conley, J.J.; Jarrett, R.P.; Clark, N.N. Further validation of artificial neural network-based emissions simulation models for conventional and hybrid electric vehicles. J. Air Waste Manag. Assoc. 2006, 56, 898–910. [CrossRef] [PubMed] Le Cornec, C.M.; Molden, N.; van Reeuwijk, M.; Stettler, M.E. Modelling of instantaneous emissions from diesel vehicles with a special focus on NOx: Insights from machine learning techniques. Sci. Total Environ. 2020, 737, 139625. [CrossRef] [PubMed] Li, Q.; Qiao, F.; Yu, L. A machine learning approach for light-duty vehicle idling emission estimation based on real driving and environmental information. Climate 2016, 1, 1–7. [CrossRef] Appl. Sci. 2022, 12, 803 29. 30. 31. 32. 33. 34. 35. 36. 37. 38. 39. 40. 41. 42. 43. 29 of 29 Barth, M. The comprehensive modal emission model (CMEM) for predicting light-duty vehicle emissions. In Transportation Planning and Air Quality IV: Persistent Problems and Promising Solutions; ASCE: Reston, VA, USA, 2010; pp. 126–137. Ben-Chaim, M.; Shmerling, E.; Kuperman, A. Analytic modeling of vehicle fuel consumption. Energies 2013, 6, 117–127. [CrossRef] Xiang, Q.; Wang, W.; Lu, J. A methodology to develop macro-fuel consumption models for the urban transportation system. Civ. Eng. J. 2004, 37, 104–107. Abukhalil, T.; AlMahafzah, H.; Alksasbeh, M.; Alqaralleh, B.A. Fuel consumption using OBD-II and support vector machine model. J. Robot. 2020, 2020. [CrossRef] Services, E.E. Data Science and Big Data Analytics: Discovering, Analyzing, Visualizing and Presenting Data; Wiley: Hoboken, NJ, USA, 2015. Government of Canada. Fuel Consumption Ratings. 2021. Available online: https://open.canada.ca/data/en/dataset/98f1a129 -f628-4ce4-b24d-6f16bf24dd64 (accessed on 30 November 2021). Government of Canada. Fuel Consumption Testing. 2021. Available online: https://www.nrcan.gc.ca/energy-efficiency/ transportation-alternative-fuels/fuel-consumption-guide/understanding-fuel-consumption-ratings/fuel-consumptiontesting/21008 (accessed on 30 November 2021). Pounis, G. Analysis in Nutrition Research: Principles of Statistical Methodology and Interpretation of the Results; Academic Press: Cambridge, MA, USA, 2018. Albawi, S.; Mohammed, T.A.; Al-Zawi, S. Understanding of a convolutional neural network. In Proceedings of the 2017 International Conference on Engineering and Technology (ICET), Antalya, Turkey, 21–23 August 2017; pp. 1–6. Quality of Urban Air Review Group. Diesel Vehicle Emissions and Urban Air Quality; University of Birmingham, Institute of Public and Environmental Health, School of Biological Sciences: Birmingham, UK, 1993. Benesty, J.; Chen, J.; Huang, Y.; Cohen, I. Pearson correlation coefficient. In Noise Reduction in Speech Processing; Springer: Berlin, Germany, 2009; pp. 1–4. Tallarida, R.; Murray, R. Chi-Square Test. Manual of Pharmacologic Calculations; Springer: New York, NY, USA, 1987. Van Hieu, N.; Hien, N.L.H. Automatic plant image identification of vietnamese species using deep learning models. Int. J. Eng. Trends Technol. 2020, 68, 25–31. [CrossRef] Hien, N.L.H.; Van Huy, L.; Van Hieu, N. Artwork Style Transfer Model using Deep Learning Approach. Cybern. Phys. 2021, 10, 127–137. [CrossRef] Hien, N.L.H.; Tien, T.Q.; Hieu, N.V. Web crawler: Design and implementation for extracting article-like contents. Cybern. Phys. 2020, 9, 144–151. [CrossRef]