Introduction

Fine particulate matter (PM2.5, with an aerodynamic diameter of <2.5 µm) has been identified as a key component of atmospheric pollution in China due to large amounts of precursor emissions and unfavorable meteorological conditions1, especially during cold seasons. Inhaling PM2.5 of high concentration poses a serious threat to public health, leading to increased risks of cardiovascular and respiratory diseases2. Therefore, accurate prediction of PM2.5 concentrations is indispensable both for public health warnings and emission controls.

The main approaches of PM2.5 prediction can be divided in two categories: i.e., numerical models and statistical models. Numerical models, such as the WRF-Chem3, CMAQ4, and GEOS-Chem5, describe the essential atmospheric physical and chemical processes through mathematical formulas, which establish the explicable relationship among atmospheric components, meteorological parameters, and emissions in spatial and temporal dimensions. These numerical models, termed Chemistry Transport Models (CTMs), have been extensively used in PM2.5 prediction6,7, source apportionment8,9,10 and mechanism analysis11,12,13 worldwide. However, large uncertainties arise from emission rates14, meteorological data15, as well as the over-simplified chemical parameterizations16, may result in certain deviations from the observed PM2.5 concentrations. To reduce model uncertainties, data assimilation has been extensively utilized to provide more precise initial chemical conditions by incorporating in-situ pollutant observations and satellite retrievals into CTMs17,18,19,20,21. It showed beneficial effects on PM2.5 forecasts for up to 24 h while the benefit from data assimilation diminished rapidly with forecast range17,21. In addition, low computational efficiency is another disadvantage of the CTMs, which usually take several hours for one simulation day with fine resolution22.

Statistical methods essentially establish the relationship between multiple predictors and historically observed PM2.5 concentrations through regression models or machine learning (ML) algorithms, which do not involve the complex physical and chemical processes and usually have less demand for computing resources compared to the CTMs. Initially, linear regression models were applied to PM2.5 forecast and showed reasonable accuracy23. However, the linear assumption in these models sometimes might not precisely capture the connection between predictors and air pollutant concentrations. In recent years, ML algorithms have gained popularity in air quality forecast due to their strong power in effectively handling nonlinear relationship between predictors and targets24. For example, Ma et al.25 utilized the XGBoost algorithm to predict PM2.5 concentrations in Shanghai and reported a correlation coefficient (R) of 0.77 and Root Mean Square Error (RMSE) of about 12–18 μg m−3 for 24 h ahead prediction. Bi et al.26 developed a PM2.5 forecast system by combing the random forest algorithm with CTM model results in central China, which had R2 of 0.76 and 0.64 for the next 2 days. These ML methods show good capacities in capturing non-linear relationships between features and aerosol concentrations at single sites. However, air pollutant forecasting is usually a regional issue related to both spatial relations and time sequences. The traditional ML methods are still difficult to resolve the complex spatiotemporal correlations27.

Nowadays, deep learning (DL) networks show remarkable capabilities in forecasting air pollutant variations27,28 due to their successful applications in dealing with nonlinear spatiotemporal correlations and advances in computing resources. DL is based on multiple-layer neural networks. A kind of the representative models is Convolutional Neural Network (CNN) - Recurrent Neural Network (RNN) architectures in which CNNs extract features and spatial relations29,30 while RNNs handle the temporal dependencies31,32. Long Short-Term Memory (LSTM) is a type of RNN specially designed for overcoming gradient vanishing and exploding when managing long-term dependencies33, which has been extensively employed in spatial-temporal forecasting of air pollutant30,34,35,36,37. For example, Yan et al.30 built a hybrid DL network (CNN-LSTM) to predict PM2.5 concentrations in next 6 h in Beijing based on historical PM2.5 data, meteorological indicators, and spatiotemporal data. Pak et al.35 combined the CNN-LSTM model with a spatiotemporal feature vector to reflect the relations among parameters to forecast the next day’s daily PM2.5 concentrations in Beijing and outperformed the traditional CNN-LSTM method. Yeo et al.36 integrated the gated recurrent unit (GRU) algorithm, a similar architecture as LSTM but has fewer parameters, with a CNN accounting for the geographical correlation of nearby stations. Using the new method, they improved the 24-h PM2.5 prediction in Seoul by about 10%. Overall, the DL-based model is a promising and efficient tool for PM2.5 forecasts. However, the proposed DL models in above studies are usually for forecasting air pollution concentrations in next day on urban scale, which may not fully meet the requirements in practical applications.

In contrast to previous DL architectures, we develop a more advanced spatial-temporal DL model for short-range (0–72 h) PM2.5 forecast on regional scale, named air Pollution-Predicting Net for PM2.5 (PPN). Our model injects the feature variables in different convolutional layers in terms of their impacts on PM2.5, to imitate the behavior of CTMs. In addition, the observed PM2.5 concentrations over multiple preceding time-steps are also included to provide accurate initial field of PM2.5 forecast, like the Four-Dimensional Data Assimilation (FDDA) in CTMs. Thus, the PPN model integrates the strengths of DL network, CTMs and assimilation and is expected to achieve better performance in forecasting regional PM2.5 concentration with wide forecast time range. We apply the PPN model for forecasting hourly PM2.5 concentrations over a highly populated and industrialized region in China. Firstly, model performance for PM2.5 forecasts in both temporal and spatial distributions is evaluated, which is followed by impacts of weighted loss function and preceding observed PM2.5. Besides, we also compare the PPN model performance with the WRF-Chem results. Finally, model structure and data information are displayed.

Results

Overall performance

Using the trained PPN model and predictors, we forecasted the PM2.5 concentrations in January 2022 at every 9 km × 9 km grid cell over the Beijing-Tianjin-Hebei (BTH) region. Figure 1a displays the scatter plot of predicted and observed PM2.5 concentrations for January 2022 with the initial forecast time of UTC 00 every day. As we see, the R2 and RMSE values are 0.70 and 17.7 μg m−3, respectively. The result for June 2022 when PM2.5 pollution was light is also shown in Supplementary Fig. 1, with R2 and RMSE values of 0.49 and 6.9 μg m−3. The lower R2 and RMSE in comparison to that in winter may be a result of low PM2.5 concentration in June with small fluctuation. There is little difference among forecast results from different initial forecast time (Supplementary Table 1). The fairly good performance achieved by the PPN model suggests that it is suitable for forecasting PM2.5 concentrations. Aerosol pollution is usually severe in winter due to stagnation and intensive emissions38, so we mainly focus on model evaluation for the results in January 2022 in the following.

Fig. 1: Three-hourly scatterplots and probability distributions of observed and predicted PM2.5 concentrations over the BTH region during January 2022.
figure 1

The initial forecasting time is 00 UTC every day. The red lines denote the linear fitting lines, and the dash lines are y = x reference lines. a Comparison by PPN results. b Comparison by PPN_no_PO results (The PPN_no_PO will be specifically explained in “Impacts of preceding observation restraint”).

Figure 2 shows the spatial distributions of observed and forecasted PM2.5 concentrations. It can be found that the heavily polluted region is in the south with average concentration exceeding 100 μg m−3. The spatial pattern of PM2.5 over the BTH region is predominantly related to emission, meteorological and orographic factors39. The southern region is more industrialized than the northern region, resulting in higher emission rates of PM2.5 and its precursors. North winds can blow away haze, leading to more clean air in the north region, while south winds facilitate pollutant accumulation along mountains in the south. These determinant factors are all considered in the model training process. Therefore, the trained model can accurately capture the spatial pattern of PM2.5 with higher values in the south and lower values in the north over the BTH region.

Fig. 2: Spatial performance of the PPN model.
figure 2

a, b Comparisons of the spatial distribution of observed and predicted PM2.5 concentrations by the PPN model over the BTH region during January 2022. c, d Corresponding MB (mean bias) and RMSE values. The initial forecasting time is 00 UTC every day.

It should be noted that RMSEs exhibit higher values in the southern region (up to 40−50 μg m−3) and lower values in the north (about 10−30 μg m−3) (Fig. 2c). Although higher base PM2.5 concentration in the south could explain part of it, the underpredicted PM2.5 with MB values of −15−−5 μg m−3 in the south may be the primary cause of high RMSEs. We further analyze the temporal variations of biases over three representative cities (Beijing, Shijiazhuang and Handan) from north to south (Supplementary Fig. 2). It is found that large biases in Handan, a southern city, mainly occurred around January 12 and 25 when cold air was moving from the north to the south (Supplementary Fig. 3). It dispelled the haze in the north whereas blew pollutants to the southern region. With regards to the southern cities, the “polluted” north winds are beneficial to increasing PM2.5 concentrations. However, the PPN model still has difficulty in capturing PM2.5 that is rapidly transported from a long distance. This deficiency may be caused by the consideration of a small range of transport process (45 × 45 km) in a timestep with limited computing resources.

We also compare observations with predicted PM2.5 concentrations for 0−72 h in advance, as shown in Fig. 3 (orange lines). The predicted mean PM2.5 concentrations are about 65 μg m−3 over the whole region, which are slightly lower than observations (Figs. 3a and d). The PPN results present decreased R2 and increased RMSE with forecast lead time, and the space-average values of 0.57−0.74 and 12−18 μg m−3, respectively. It is worth noting that the model performance gets worse with forecast lead time. This downward trend is particularly obvious during the 0–24 h and becomes stable after 24 h. As shown in Fig. 3b, the spatial R2 is about 0.74 at the initial forecast time, then rapidly reduced to 0.58 at the forecast lead time of 24 h. The RMSE value is about 12 μg m−3 in the beginning, then decreases to 17 μg m−3 at 24 h (Fig. 3c). The model performance levels off during the forecast lead time of 24−72 h with R2 of about 0.57 and RMSEs of 17−18 μg m−3, which is still an acceptable result. In a word, the model exhibits better performance at 0−24 h and achieves stable and reasonable prediction capacity when the forecast time extends to 3 days.

Fig. 3: Temporal performance of the PPN and PPN_no_PO model.
figure 3

Temporal variations of (a) observed and predicted PM2.5 concentrations by PPN and PPN_no_PO during 0−72 h forecast period for the test dataset (January 2022); (b) R2; (c) RMSE; (d) MB.

As mentioned in “Introduction”, ML algorithms have been extensively applied in air quality forecast due to strong capacity in effectively dealing with nonlinear relationship. Here, we compare the PPN model results with the other three models based on Random Forest, XGBoost and Multilayer Perceptron algorithms, called RF, XGB and MLP for short, respectively. Random Forest and XGBoost are both ensemble learning algorithms based on decision trees for classification and regression, and have been employed in pollutant dataset construction40 and air quality forecast25,26. Multilayer Perceptron is a basic algorithm for neural networks. In order to emphasize the advantage of the deep network of PPN, we employed a shallow network of the MLP model for comparison. The MLP model has 2 hidden layers of 32 and 8 neurons. We trained the RF, XGB and MLP models using the same features and time periods as those in the PPN model. In addition, each grid over the BTH region was utilized in the model training process, which do not fully consider the spatiotemporal correlations between the predictors and the target. The comparison results show that our PPN model has a prediction performance superior to the other three models in forecasting PM2.5 variations over 13 cities during January 2022 (Supplementary Fig. 4 and Fig. 5). This is highly related to the utilization of multi-convolution layers to capture the spatial relationship among grid cells and LSTM layers for temporal variations, as well as the introduction of preceding observation restraint and weighted loss function that will be elaborated in the following.

Impacts of weighted loss function

From the above, we can conclude that the proposed PPN model can well capture the spatiotemporal variability of PM2.5 concentration over the BTH region for the next 3 days. Interpolated PM2.5 data based on site observations were applied as fitting targets to acquire the gridded results. However, the interpolation may introduce biases into the target dataset, especially in regions with sparsely or unevenly distributed sites. In this study, we introduced a weighted loss function in the model training process to eliminate the adverse effects of PM2.5 interpolation. Detailed information about the weighted loss function can be found in “Methods” section at the end. The influence of the weighted loss function is evaluated in this section to illustrate the innovation of the PPN model. To achieve this, we conducted a comparative experiment that uses Mean Square Error (MSE) as loss function and do not consider the Inverse Distance Weighted (IDW)-based loss function adopted in the PPN model, called PPN_no_IDW for short. The comparison between PPN and PPN_no_IDW results under different PM2.5 levels is shown in Fig. 4. The PM2.5 classification is based on the Technical Regulation on Ambient Air Quality Index (HJ 633–2012). Here, MB is utilized as the evaluation metric instead of RMSE, to better differentiate between overprediction and underprediction. The comparison is based on results from the 114 monitoring sites over the BTH where direct observations are available.

Fig. 4: Probability density of MB values from PPN and PPN_no_IDW results.
figure 4

The results are for the testing dataset (January 2022) under different PM2.5 concentration levels. The dash lines denote the corresponding average values. The MB values are derived from results at the 114 monitoring sites over the BTH region. a PM2.5 ≤ 35 μg m−3; b 35 μg m−3 < PM2.5 ≤ 75 μg m−3; c 75 μg m−3 < PM2.5 ≤ 115 μg m−3; d PM2.5 > 115 μg m−3.

From Fig. 4, we find that the PPN model tends to overpredict PM2.5 concentrations under clean condition (PM2.5 concentration ≤35 μg m−3) while underprediction occurs when the pollution level aggravates. The “high values underestimation, low values overestimation” phenomenon was also previously reported by Mao et al.41, suggesting the difficulty of capturing extreme values by deep learning algorithms at present. By comparison, the PPN model with weighted loss function outperforms the PPN_no_IDW and decreases the level of “high values underestimation, low values overestimation”. The improvement by the PPN is significant when PM2.5 concentration is ≤35 μg m−3 (clean condition) and >115 μg m−3 (moderate or heavy polluted condition). The probability of MB values >15 μg m−3 from the PPN model under clean condition is much less than that from the PPN_no_IDW (Fig. 4a). Under moderate or heavy polluted condition, the probability of large bias is also reduced by the PPN model. The average MB value by PPN is −28 μg m−3, whereas that from the PPN_no_IDW is −35 μg m−3 (Fig. 4d). The improvement by the PPN model is less significant when PM2.5 concentration is ≤115 μg m−3 and >35 μg m−3 (Fig. 4b, c). Overall, the PPN model, which considers more information of in-situ observations in the loss function, can achieve higher prediction accuracy at the monitoring sites.

Impacts of preceding observation restraint

To imitate the FDDA behavior for CTMs, another key feature of the PPN model is to consider the preceding PM2.5 observations within the encoder phase. To verify the significance of the observation restraint, our study also established a model similar to the PPN framework but without the Preceding Observation restraint in the encoder phase (called PPN_no_PO). Figure 1b shows the scatterplot of observed and predicted PM2.5 concentrations by PPN_no_PO for the test dataset. Compared to the PPN results, the PPN_no_PO model shows poorer performance with a lower R2-value of 0.63 and larger RMSE value of 19.9 μg m−3. Figure 2 also displays the predicted PM2.5 concentration by the PPN_no_PO for multi-step ahead prediction (green lines) and the comparison with the standard PPN result. The spatial R2 and RMSE values from the PPN_no_PO result remain at about 0.57 and 18 μg m−3 while those from the PPN model are 0.57−0.74 and 12−18 μg m−3, respectively. The better performance of the PPN model is readily evident during the 0−24 h forecast lead time. It highlights the importance of considering the preceding PM2.5 concentration in the encoder phase and indicates that the proposed PPN model is more advanced in PM2.5 forecast than traditional DL methods.

We also compare the PPN and PPN_no_PO results over the monitoring sites since most of these sites are close to human settlements and susceptible to attracting public attention. The predicted temporal characteristics of PM2.5 by PPN and PPN_no_PO over 13 cities of the BTH region are shown in Supplementary Fig. 6. The corresponding evaluation metrics are shown in Fig. 5. The comparisons in Supplementary Fig. 6 and Fig. 5 are based on results in 0−24 h forecast lead time when the difference between PPN and PPN_no_PO is most prominent. In general, the PPN-predicted PM2.5 concentrations are in good agreement with the observations at the monitoring sites. It well captured two regional pollution events during Jan 6−12 and Jan 22−26. The values of R2 and RMSE are 0.42−0.84 and 15−42 μg m−3, respectively. Similar to the previous results, the PPN predicted PM2.5 in the northern cities (BJ, TJ, LF, ZJK, CD, TS, QHD) performs better than those over the southern region (BD, SJZ, HD, CZ, HS). By comparison, the PPN architecture exhibits better performance than PPN_no_PO in forecasting PM2.5 concentration over most monitoring sites with larger R2 and smaller RMSE (Fig. 5). The performance improvement is more significant in the northern cities, such as BJ, TJ, LF and TS, with RMSE values decreased by 20−28% compared with those from PPN_no_PO. However, the reduction ratios are just 3−6% over the southern cities (HD, XT, HS).

Fig. 5: Evaluation metrics for predicted PM2.5 concentrations by PPN and PPN_no_PO.
figure 5

The results are derived from 13 cities of the BTH region during January 2022. The comparisons are based on results in 0−24 h forecast lead time when the difference between PPN and PPN_no_PO is most prominent. a R2, b RMSE.

Comparison with the WRF-Chem results

To further evaluate the performance of the PPN model in forecasting PM2.5 concentrations, we also compare it to the state-of-the-art CTM WRF-Chem model. In contrast to the PPN model, the WRF-Chem model integrated with data assimilation is built on the numerical methods that fully takes into account the atmospheric physical and chemical processes. It is a more interpretable tool for air quality forecast. Supplementary Fig. 7 shows the time series of predicted PM2.5 by PPN and WRF-Chem over monitoring sites during January 2022, and the corresponding evaluation indices are displayed in Fig. 6. The temporal R2 and RMSE values from the WRF-Chem are 0.30−0.77 and 19−45 μg m−3 while those from the PPN model are 0.42−0.84 and 15−42 μg m−3. Among 13 cities over the BTH, the RMSEs from the PPN model are 1−35% lower than those from the WRF-Chem. The reduction ratios in 10 cities are >10%. The PPN forecast result in June 2022 is also superior to the WRF-Chem result with larger R2 and lower RMSE (Supplementary Figs. 8 and 9). Therefore, we conclude that the PPN model outperforms the WRF-Chem. Moreover, we also evaluate the model capacity in forecasting PM2.5 concentration at different forecast lead times over the monitoring sites, as illustrated in Fig. 7. It is noted that the average RMSE from the PPN model is 25 μg m−3 at 0−24 h which is 17% lower than that from the WRF-Chem (30 μg m−3) (Fig. 7a). The reduction ratios slightly decrease to 14−16% with increased forecast lead time (24−72 h). The average R2-values of the PPN model are 0.61−0.68, which are higher than those of the WRF-Chem model (0.53−0.57) (Fig. 7d). No matter what the forecast lead time is, the PPN model outperforms the WRF-Chem in forecasting PM2.5 variations. We also find that the PPN model exhibits better predictive capacity particularly under clean and good air quality conditions (PM2.5 < 75 μg m−3), up to 25% lower than the RMSE from WRF-Chem. However, the maximum reduction ratio on polluted days is just 13%.

Fig. 6: Evaluation metrics for predicted PM2.5 concentration by PPN and WRF-Chem.
figure 6

The results are derived from 13 cities of the BTH region during January 2022 with the forecast lead time of 0−24 h. a R2, b RMSE.

Fig. 7: Box plots of RMSEs and R2 from PPN and WRF-Chem.
figure 7

The results are at 0−24 h, 24−48 h and 48−72 h forecast lead time over 13 cities on the test dataset (January 2022). The red dots, orange lines denote mean and medium values, and boxes and whiskers represent 25%−75 and 10−90% percentiles. a RMSEs for total test dataset; b RMSEs for clean and good conditions (PM2.5 ≤ 75 μg m−3); c RMSEs for pollution conditions (PM2.5 > 75 μg m−3); d R2 for total test dataset.

Summary and discussion

In this study, we developed a new spatial-temporal PPN model for short-range PM2.5 forecasting and applied it to the BTH region. The model separated the input features into “local” or “non-local” layers according to their direct impacts on PM2.5 to imitate the behavior of CTMs. We trained and validated the PPN model with meteorological, emission and derived data during 2020 and 2021, and utilized the trained model to predict PM2.5 variations in January 2022, which is served as the testing dataset. The predictive ability of the model was evaluated on a test dataset with observations.

Overall, the proposed model exhibits good accuracy in predicting spatial and temporal variations of PM2.5 concentrations over the BTH, with R2 and RMSE values of 0.70 and 17.7 μg m−3, respectively. Regarding spatial characteristics, the model successfully captures the high PM2.5 concentrations in the south and relatively low concentrations in the north. The model shows better performance in the next 24 h with spatial R2 and RMSE of 0.58−0.74 and 12−17 μg m−3, respectively. Then the performance gets stable as the forecast lead time increases. In a word, the model has good capacity in forecasting spatial and temporal variability of PM2.5 over the BTH.

The consideration of the weighted loss function and preceding PM2.5 observation in the encoder phase are two innovative points in the model structure. To further explore the advantage of the PPN model, we conducted two sensitivity tests, in which the aforesaid two points are not taken into account. Results show that the PPN model with weighted loss function outperforms the PPN_no_IDW and decreases the level of “high values underestimation, low values overestimation”. Considering the preceding PM2.5 concentration into encoder phase greatly improves the results in the next 24 h with R2 increased from 0.57 to 0.58−0.74 and RMSE decreased from 18 μg m−3 to 12−17 μg m−3. This improvement is also evident over the monitoring stations, especially over the northern cities. We also compare the model performance with the state-of-the-art WRF-Chem results. By comparison, the PPN model shows better predictive accuracy than WRF-Chem. The temporal R2 and RMSE values from the WRF-Chem are 0.30−0.77 and 19−45 μg m−3 while those from the DL model are 0.42−0.84 and 15−42 μg m−3. This better performance exists within all the forecasting lead times.

Spatial-temporal DL algorithms exhibit powerful capacity in dealing with non-linear correlations, making them suitable for air quality forecast. Several hybrid DL networks (CNN-LSTM and Graph Convolutional-LSTM) have already been successfully applied in PM2.5 forecast in winter Beijing with RMSE values of about 22−53 μg m−3 in the first 6 h30 and 24 μg m−3 for next 24 h42, respectively. Unlike the PPN model, these two results are based on a single or several surrounding sites in Beijing. The corresponding RMSEs from PPN are 11−17 μg m−3. Although the comparison is based on different forecast period, the better performance of PPN can still demonstrate the advanced network structure that imitates the behavior of CTMs with inclusion of the preceding PM2.5 observations and weighted loss function. In summary, the PPN model demonstrates strong potential for the application of spatiotemporal PM2.5 forecast over the BTH region, and it is an efficient and accurate tool for regional PM2.5 forecast.

The application of the PPN model could be extended to other components of air pollution, e.g., ozone and nitrogen oxides, with their related meteorological data and emission data as the input parameters. Moreover, it should be noted that we designed the PPN model for short-range air quality forecasts. For medium-range (over 5 days) forecasts on a day-to-day basis, spatial information would be more important than short-range one due to long-distance transport. Therefore, there would be a different DL model for the medium-range air quality forecasts.

Methods

Model structure

PPN has an encoder-decoder architecture and uses the PredRNN43 as the backbone network for capturing the spatiotemporal variation for PM2.5 concentrations. The PredRNN uses multi convolution layers to capture the spatial relationship among grid cells and LSTM layers for temporal variations. It is an updated version of ConvLSTM44, which adds a shortcut connecting the last convolutional layer and the first convolutional layer between adjacent timesteps. The model structure is shown in Fig. 8.

Fig. 8: Structure of PPN model.
figure 8

The PPN model structure with an encoder-decoder architecture. The spin-up timesteps are encoding phase (Tm, ...T0) and the forecasting timesteps are decoding phase (T1, ...Tn).

Spatially, the model is designed following the order of aerosol-related processes in CTMs. There is one convolutional layer with the convolutional kernel sizes of 1 × 1 (9 × 9 km), representing the local processes (i.e., chemical and turbulence diffusion processes). Two non-local layers follow, representing the short-distance transport and long-distance transport, with 3 × 3 (27 × 27 km) and 5 × 5 (45 × 45 km) respectively. Rather than considering all input features in the same layer as the PredRNN, the PPN framework firstly separate the feature variables into two parts, i.e., local variables and non-local variables (Table 1), according to their main physical-chemical effects on aerosol. The local variables including emissions, temperature, humidity, precipitation, sea level pressure (SLP) and planetary boundary layer height (PBLH), etc., which are more directly to affect local PM2.5 formation and depletion, thus are gone into the first “local layer”. These variables concerning synoptic patterns and transport process, such as wind, geopotential height at 700 hPa and terrain height, introduce to the PPN model at the beginning of “short transport layer”. It should be noted that now the PPN model only considers a slightly small range of transport process (within 5 × 5 grid cells) in a timestep due to the limitation of our computing resources. Therefore, large-scale inter-regional transport within one timestep may not be well represented by the model. More computing resources that can afford a large size of convolutional kernel could help to solve the problem in the future.

Table 1 Selected features used in this study.

Temporally, the model uses an asymmetric encoder-decoder structure, all the timesteps before the initial time are encoding phase, like the spin-up phase in CTMs and Numerical Weather Prediction (NWP) models. The forecasting timesteps are decoding phase. The outputs of decoder in every timestep are the PM2.5 forecasts. During the decoding phase, the model inputs the meteorological variables, emissions and PM2.5 forecast from the last timestep as the feature variables. And in every timestep of the encoding phase, in addition to the above variables, the PPN model added the PM2.5 observation in the first convolutional layer. Since the LSTM is able to learn and remember the input information for the past period, we expect the PPN to provide a better initial field by adding multi observations across a preceding time period. This idea is similar to a sequential FDDA such as grid nudging45,46 or Ensemble Kalman Filter47.

Experimental region and feature selection

We took the Beijing-Tianjin-Hebei (BTH) region (36.3−42.0°N, 111.8−121.6°E) as our experimental region, which is surrounded by mountains in the north and west and consists of megacities like Beijing (BJ) and Tianjin (TJ) (Fig. 9b). The BTH region is a highly industrialized and densely populated region confronting with severe aerosol pollution48, thus has strong demands for air quality prediction to formulate emission reduction strategy and alert the public health.

Fig. 9: WRF domain configuration and the experimental region.
figure 9

The colors, red dots and yellow characters in the right panel denote the terrain height (m), PM2.5 observation sites and city locations over the BTH, respectively. a WRF domain; b PPN experimental region.

PM2.5 concentration is influenced by both meteorological conditions and precursor emissions. Therefore, we classified the input parameters into three categories: i.e., meteorological variables, emission data and derived parameters, as listed in Table 1. Meteorological data were from the Weather Research and Forecasting (WRF) model simulation that provides a downscaled meteorological field based on the European Centre for Medium-Range Weather Forecasts (ECMWF) ERA5 reanalysis data (https://cds.climate.copernicus.eu/, last access: 2023/02/27). The WRF model was configured to cover the whole China region (Fig. 9a) following Feng et al.49, with a horizontal resolution of 9 km and 38 vertical layers up to 50 hPa. The initial and boundary conditions were obtained from the ERA5 reanalysis data with horizontal resolution of 0.25° × 0.25°. Supplementary Table 2 shows the detailed parameterization schemes used in the WRF simulation. The meteorological variables included factors that relate to synoptic pattern (SLP, geopotential height at 700 hPa), aerosol chemistry (temperature, relative humidity), transport (wind at 10 m and 850hPa, terrain height), vertical diffusion (PBLH), and wet scavenging (precipitation).

We also considered the emission data as input features to distinguish spatial and monthly variations of PM2.5 precursors. The emission data were based on the Multi-resolution Emission Inventory for China (MEIC) developed by Tsinghua University (http://meicmodel.org/, last access: 2022/6/30). It is known that SO2, NOx, VOCs and primary PM2.5 directly contribute to the formation of sulfate, nitrate, black carbon, organic carbon and other inorganics that constitute PM2.5 in the atmosphere. Thus, emission rates of these species were utilized as input features in this study. As the original emission data are in the resolution of 0.25°, we firstly interpolated them to the WRF grid with a horizontal resolution of 9 km.

In addition to the meteorological and emission variables, several derived parameters were also used as input features, as suggested by aerosol-meteorology interactions. Tai et al.50 concluded that high PM2.5 concentrations in the continents were correlated with reduced SLP due to simultaneous presence of cyclones and convergence. Therefore, the change of SLP compared to 24 h ahead was selected as an input feature in our study. Additionally, cold fronts bring a decrease in temperature while stagnation is usually associated with increasing or steady temperature. Thus, changes of temperature indicate synoptic variations thus affecting aerosol concentrations. Rather than utilizing temperature changes at surface, we employed those at 850 hPa to minimize topographic and urban effects. Besides, annual mean PM2.5 concentrations (averages during 2020–2021) in each grid were also adopted as a static feature for learning spatial correlations. Another important feature is the preceding PM2.5 observation in the encoder phase and forecast start time to reduce the forecast error that has already been previously introduced.

All of the input variables were firstly standardized to eliminate impacts of unit inconsistency on forecasting results. We assumed that all data obey the standard normal distribution with a mean value of 0 and a standard deviation of 1, thereby they could be standardized one by one following:

$$z=\frac{x-\bar{x}}{\sigma }$$
(1)

where \(z\) and \(x\) represent the standardized and original data, \(\bar{x}\) and \(\sigma\) are mean and standard deviation values.

Observation data

Surface PM2.5 concentrations were obtained from monitoring stations over the BTH region and its surrounding area, accessed from the China National Environmental Monitoring Centre (CNEMC, http://www.cnemc.cn/). The distribution of these stations is shown in Fig. 9b. To keep consistent with the input feature map, we interpolated the station data of PM2.5 concentration to grid data with a horizontal resolution of 9 km using the Inverse Distance Weighted (IDW) method. The stations with high density over the BTH and its surrounding areas could reduce the bias from interpolation to some extent. The gridded PM2.5 data were applied both in the fitting target and input features. These input features are the annual mean PM2.5 and the preceding PM2.5 which have been introduced in ”Experimental region and feature selection“.

WRF-Chem results

The purpose of our study is to develop a more efficient and accurate PM2.5 forecasting system based on the PPN framework. For comparison purposes, we also adopted the WRF-Chem model to predict hourly PM2.5 concentrations with a resolution of 9 km, covering the North China Plain and its surrounding areas. The WRF-Chem model has the same meteorological configuration as the WRF described in “Experimental region and feature selection”. For chemistry, the Carbon Bond Mechanism version Z (CBMZ)51 and the Model for Simulating Aerosol Interactions and Chemistry (MOSAIC)52 were used as the gas-phase chemistry and aerosol mechanisms to conduct PM2.5 forecasting. Emission rates of anthropogenic pollutants were also from the MEIC inventory, the same as those in the PPN framework. To improve the forecast skill, the three-dimensional variational (3DVAR) algorithm was applied in the WRF-Chem model. Surface pollutant observations (PM2.5, PM10, SO2, NO2, O3, and CO) were assimilated into the model following the method in Sun et al.20. The predicted PM2.5 concentrations generated from the WRF-Chem model were compared with the PPN results to further demonstrate the performance of the proposed DL model.

Experiment

In this study, the input features together with the PM2.5 observations in 2 years (2020–2021) were adopted as the training and validating datasets, 90% of which were randomly chosen as the training set (for model training) and 10% of which were the validating set (for model hyper-parameter optimization or tuning). Data from January and June 2022 were employed as the test dataset to evaluate the model performance. Aerosol pollution is usually severe in winter due to stagnation and intensive emissions38, thus we mainly focus on model evaluation for the January 2022 results. There are 5800 sequences with 80 × 80 grids in the training and validating dataset. Each sequence in the encoder-decoder architecture contained 2 days for inputs and 3 days for prediction with a time resolution of 3 h. In the PPN model training, Adam Optimizer and one-cycle learning rate schedule with a weight decay parameter of 10−4 were utilized. This learning rate schedule can substantially speed up training process with high test accuracy53. The gradient clipping strategy was also adopted in our model to avoid “gradient explosion” phenomenon.

As described in “Observation data”, the interpolated PM2.5 data were applied as fitting targets to acquire a gridded result. However, the interpolation may introduce biases into the target dataset, especially in regions with sparsely or unevenly distributed sites. To eliminate the influence of interpolation on model training, we proposed a weighted loss function based on the IDW method and mean square error (MSE) metric (WMSE). The WMSE is defined as:

$${WMSE}=\frac{{\sum \nolimits_{i=1}^{n}{w}_{i}\times (\hat{{y}_{i}}-{y}_{i})}^{2}}{\sum\nolimits_{i=1}^{n}n\times {w}_{i}}$$
(2)
$${w}_{i}=\left\{\begin{array}{l}\frac{\sum\nolimits_{j=1}^{1\le {d}_{j}\le 3}\frac{1}{{d}_{j}^{2}}}{{m}_{i}}\,1\le {d}_{j}\le 3\\ 1\,{d}_{j}=0\\ 0.1\,{d}_{j} \,>\, 3\end{array}\right.$$
(3)

where \({w}_{i}\) denotes the weight factor for MSE in grid i, \({y}_{i}\) and \(\hat{{y}_{i}}\) represent the observed and predicted PM2.5 concentrations in grid \(i\). \({w}_{i}\) is considered as 1 if there is one or more PM2.5 monitoring stations in gird \(i\). If there is no station in gird \(i\), \({w}_{i}\) is the average of square sum of the inverse distance between gird \(i\) and the stations in 3 × 3 grids. If there is no station in the surrounding 3 × 3 grids, \({w}_{i}\) is set to a minimum value 0.1. The application of WMSE increases the weights of MSE in areas near the PM2.5 monitoring sites, thus reducing biases introduced by interpolation over regions with sparse sites during the model learning process.