Working with temperature data sources for Hanoi, 2019

Binh Nguyen, Independent Researcher, Hanoi, Vietnam

Making sense from available data is a real challenge. With ever-increasing numbers of sensors and connected devices such as IoTs, knowing how to compare the collected data with other sources becomes essential to ensure that the data at hand is useful and a true representation of the physical world. Undoubtedly, the temperature is one of the most simple environmental parameters that both simple to measure, abundant options for sensors, and wide usages to great impacts. In this post, I presented my analysis to data measured by low-cost sensor, forecasting model, measurement by an official station, and reanalysis products.
1. Introduction

Temperature is a universal input for daily conversation. It can be simply as hot, cold and in-betweens. It can be in a range such as weather forecasting. It could be precise as a body temperature reading. Its widespread uses and affordable devices make the temperature a trusted reading without much scrutiny on the calibration and comparison. A datasheet or an express disclosure by the manufacturer is often sufficient for temperature accuracy.

Previously, I built a low-cost weather unit measuring basic inputs including temperature, relative humidity, light intensity. The unit is under $20 with some sensors installed in duplication. The data from this unit is acquired and stored locally in a home server. The location is in Hanoi, Vietnam with coordinates 21.02 °N, 105.83°E.

Available public sources such as nearby weather stations in Hanoi, such as Lang station is assumed to be the best setup for measuring the air temperature. Open API such as and offers great resources for weather forecasting., for example, listed a few data sources from big well-known models. Air quality stations such as with UNIS school displayed temperature and relative humidity which appeared to the forecasted data that needs a close look.

A third source is from a "reanalysis" product. I don't have behind-the-scene details of reanalysis, and I assumed its products are the best available data to represent a large scale and comprehensive sets of data. For this post, I used MERRA-2 published by NASA as the reference data.

One goal of this post is to take a close look at various data sources by using analyzing packages such as pandas with Python as the showcase for a combing open, free tool with available data. Besides, knowing relatively how close the measuring of low-cost units to an offical station, and different numerical products are beneficial to assess the accuracy of each source.

2. Methods and Materials

The low-cost unit is mounted in a balcony, away from potential local heat sources. The balcony is about 35-m above ground facing South. The unit is in the shade. The temperature included in this unit are: 2xDS18B20 (Maxim Integrated Products, Inc.), 1xSHT3x (Sensirion), DHT31, Si7021, BME280 (Bosch). The tutorial and a plan view of the sensors presented here. The data is read by an ESP8266 microcontroller and transmitted to a home server. In addition, a sample box with a Raspberry Pi has a shield with various sensors installed, including DS18B20, HDC1080, and MPL31152A. An AirVisual air quality monitor has a built-in temperature sensor. There is a comprehensive test on some of these sensors on the performance.

Forecasting sources were queried by open API including,, UNIS School website on air quality. Lang station operated by Vietnam Meteorological and Hydrological Administration. Finally, a MERRA2 product is used the reanalysis data to compare against measured, forecasted data. MERRA2 is produced by NASA targetting Earth system analysis with data dated back 1980.

3. Results and Discussion
3.1 Low-cost sensor performance

The temperature data from the low-cost unit was first compared with duplicated DS18B20 sensors as shown in Figure 1. DS18B20 is a truly affordable, reliable, and robust sensor.

Fig. 1: Data temperature of 2019 measured by a low-cost DIY unit. The unit is mounted in a balcony in a highrise building

Figure 1 offers a larger look at the monthly trend and fitted the temperature pattern in Hanoi. The Figs. 2-4 are cutouts of three occasions in 2019 to a closer looks on each reading, aggregated data such as hourly and daily averages.

Fig. 2: Temperature data in springtime, Hanoi 2019.
Fig. 3: Temperature data in summertime, Hanoi 2019
Fig. 4: Temperature data in wintertime, Hanoi 2019.

With low-cost sensors, one key disadvantage is unstable readings, in which the output, in this case, the temperature, can be yielded to default error value such as -127. This happened with weak electrical connections and temperature reading is not available in the sensor's buffer (memory), and hence a default code is sent. The second problem is the longevity of the sensor which is a headache for capacitive ones such as those measure humidity contents. With over one year of operation, DS18B20 has not shown such issues.

The reproducibility is another dimension with low-cost sensors, in which the sensor in the same type or measuring the same parameter should yield a similar reading. Fig. 5 shows aggregated data from two DS18B20 sensors for one week.

Fig. 5: Data from DS18B20 installed in duplication

Next, I show some analysis of the different readings in these two sensors. First, Fig. 6 shows five bins of differences in temperature readings with the x-axis label marked the average of each bin. The data taken this distribution is hourly average and each hour included 12 readings. The data showed that two-third of readings is less than 0.4 ° differences. This is inlined with the specification of the manufacturer, in which DS18B20 has an accuracy of ±0.5°in -10 to +85°.

Fig. 6: Distribution of the difference in reading between two DS18B20 sensors.

Of course, if we could dig into details, with open tools such as pandas, we can see the variation of two sensors, one type on temperature by the hour of the days, and by months as shown in Figs. 7-8.

Fig. 7: Difference in reading between two DS18B20 sensors by hours
Fig. 8: Difference in reading between two DS18B20 sensors by months

The outcome is a litle counter-intuitive. A higher temperature in the summer led to a lower difference but glaring sunlight in mid-afternoons appears to contribute to the larger difference in readings of the same type of the sensor.

In a sample box, I have another DS18B20 sensor installed about 2cm above the Raspberry Pi. The sample box has a fan installed on the rear end to draw out the air through the front grid window while the low-cost unit only has a window cut out from the plastic box. The Pi is a low-power Single-on-Chip (SoC) device, similar to ESP8266 which is the microcontroller for the low-cost weather unit, but much more powerful in computing power. One would hypothesize that the Pi would generate more heat and thus lead to a higher temperature. Fig. 9 presented data of the sensors on top of the Pi and two from the outside. The data does not support the hypothesis, possibly because the heat was drawn out by the fan, and Pi was running around 10% of CPU usage.

Fig. 9: Temperature readings by DS18B20 sensors

Above, I analyzed the data by DS18B20 in detail. In today's market, DS18B20 is only one of many low-cost sensors available. Some of the popular sensors for temperature are DHT11, DHT22, BME280, SHTxx, HTU21, SHT7i. This class of sensor includes a capacitive instrument for measuring the humidity content and converted to the relative humidity.

Fig. 10: Temperature readings by low-cost sensors
Fig. 11: Distribution of the difference

The distribution is skewed-right which is desirable. With the mean of the distribution of 1.3°C, it shows a larger variable than 0.5° accuracy listed by each sensor. At the same time, this value reflected the nature of low-cost sensors. And thus, for scientific observation, those drawbacks should be noted and mitigation should be applied such as a denser grid of sensors or multiple sensors installed in one place.

3.2 Local measure vs. forecasting data

With access to open API, we can query global forecasting systems such as the NOAA GFS model. To directly extract data from the GFS model output, it can be a daunting task for a limited storage and computing power. Alternatively, some website such as or offers simple and programming friendly tools to get the time series with specific coordinates. Fig. 12 shows the actual measurement (highrise) with forecasting data such as with DarkSky and the data from the UNIS School website.

Fig. 12: Temperature in building vs. forecasting
Fig. 13: Temperature in building vs. forecasting with the hourly average in the background

The pattern is clear here. The trend of daily averages between one in a highrise building with two-another forecasting data is matching. The absolute value, hower, is distinctively different. The temperature from forecasting data is lower than the actual records. The difference can be attributed to the heat retention of the building making it slower to change with the environment temperature.

Fig. 14: Distribution of the difference in temperature readings between in building vs. forecasting with the hourly average

A 3.4° higher on average in buildings is an important outcome. This indicates that the temperature in the building is hotter than the forecast one.

Next, we will compare the data from Láng station operated the Vietnam Meteorological and Hydrological Administration from the upper-air dataset. In Fig. 15, the hourly data from MERRA-2 and observational values were compared. This dataset only contains two points a day, but the comparison already messy. In Figs. 16-18, some snapshots with a shorter period to compare the two datasets.

Fig. 15: Observational temperature from Láng station is overlayed with reanalysis temperature in Hanoi area
Fig. 16: Observational temperature vs. reanalysis temperature in Hanoi area, February 2019
Fig. 17: Observational temperature vs. reanalysis temperature in Hanoi area, July 2019
Fig. 18: Observational temperature vs. reanalysis temperature in Hanoi area, December 2019

These three close snapshots indicated two sources of data in a close reading but can be distinctively different. The general trend is the Lang Station's reading is higher than the MERRA-2's. Fig. 19 confirmed this outcome by plotting the distribution of the difference in the reading of 2019.

Fig. 19: The difference of temperature data between MERRA-2 and Lang Station

Two Celsius degree difference is significant. The value is inlined with an urban heat island (UHI), in which the temperature in the city is warmer than the surrounding due to the surface property and energy uses in the city. However, it would be naive to assume that MERRA-2 has not taken account for UHI.

Finally, the three sets of data: local observation in a highrise building, a forecast data, and a reanalysis set are charted on one graph as shown in Figures 20-21. One drawback of this analysis is that I cannot specify the exact digital products from The full list of the data sources of this open API is here.

Fig. 20: A gallery of local, forecast, and reanalysis data of temperature in Hanoi, 2019.
Fig. 21: Distribution of temperature value between a forecast set, and reanalysis data in Hanoi Vietnam (2019-2020).
4. Conclusion

In summary, I analyzed the dataset from one low-cost weather unit and compared it with forecasting, reanalysis and observational data. The low-cost unit measured temperature with ±1.3°C from 5 types of sensors. The comparison of the data from this unit and the forecasting data indicated that in building, the temperature is higher than one forecasted about 3.4°C. The forcasting data is on average higher than the reanalysis 1.1°C in 2019. Finally, the observational data from Lang station is higher than the MERRA-2's about 2°C. One implication of this analysis is that there is no source data could represent the temperature for Hanoi. Any analysis should be specified the dataset and pronounced limitations.

▣ ▣ ▣