Data Source and Statistics
This data set includes hourly air pollutants data from 12 nationally-controlled air-quality monitoring sites. The air-quality data are from the Beijing Municipal Environmental Monitoring Center. The meteorological data in each air-quality site are matched with the nearest weather station from the China Meteorological Administration. For the coordinates and district of all 12 stations, we manage to use an external source that records their correct location through longtitude and latitude numbers, which is useful for visualization; district information are infered by their names. The time period is from March 1st, 2013 to February 28th, 2017, a whole consecutive 4 year span. Missing data are denoted as NA.
We have inspected the dataset, and as mentioned above, there are always several columns of data that are marked as NA for several stations during a short span of less than 2 days usually, possibly due to a temporary failure of the facilities or maintenance to them. Here we show a histogram of all the values of these pollutants.
The data is collected by the hour every day throughout all the days between March 1st 2013 and February 28th 2017, a whole 4 years of data, so this should be offering enough information of the air quality and weather for the period, not missing out any day in the this consecutive period. After we fill up the NAs using interpolation (considering all variables are dynamically changing data, guessing values thorugh interpolation is viable), and converting the wind direction from lettered labels to the number of degrees clockwise to the North direction (and interpolate it), we have made the data complete.
We think the data is coherent after analyzing it in the aspects listed below:
Coherence for the same station: All data falls in a given, expected range with no consistent variations between any part of data, especially those that are easily verifiable like rain condition, wind, temperature; No abrupt or instant change of variables are detected.
Coherence between different stations: Data between different stations, either pollutant data or weather, are mostly consistent at the same moment across stations, marking a similar whether throughout the city.
In general, the dataset can be divided into 2 kinds of information: air quality (PM10, PM2.5, NO, O3, etc.) and weather(rain status, temperature, pressure, wind direction, etc). We verified the former part by randomly searching historical air quality data on Beijing’s air quality information website, and found no error on some random samples. For the latter part we searched through the meteological historical information website and investigated that the weather data for each station falls within an acceptable range of the weather records of that same time period, given the fact that a city’s weather information merely represents an average level of weather in the whole city region, and each station should have its own variations.
This dataset is gathered from Here. A revleant paper has been published as claimed by the UCI ML datasets website. It contains all the data we need for analysis in CSV form, so the resources show considerable levels of accountablity. We acquired longtitude and latitude data for the stations from Here.