Pandas actual combat | cleaning the precipitation data of disordered stations


Here, we first give the practical application scenario of pandas in processing observation data, and then we will update the pandas introductory series of tutorials successively. Interested students can click “focus + star” and get the push at the first time. The following is the text.

  • Problem Description:

At present, we have a series of station observation precipitation data, and each file contains the data of one station, covering the period from April 17 to April 24, 2020. Taking observation station 51004 as an example, the specific contents in the text file are shown in the following figure:

It can be seen that the columns of the observation file are station name, station number, year, month, day, hour, latitude, longitude and observed precipitation. It is not difficult to see that there are two serious problems with this data:

  1. The chronological order of observation records is chaotic
  2. The observation record time is discontinuous

We hope that through pandas data cleaning, we can finally get the observation data sorted by time, which is continuous in time, and the time of missing measurement is represented by missing measurement value.

  • resolvent:
  1. Import pandas library, read site data, print preview

  1. Put together the time of month, day and year through pd.to_ Datetime converts it to a datetime object and sets it to index

  1. Keep only the required columns

  1. Sort the data by index (time)

  1. Generate target period

  1. Replace the index of DF with the target period, and the Nan will be automatically supplemented at the time of missing measurement

  1. There is something wrong with the string in the time column. Regenerate it through index

  1. The file records the information of the same site and fills in the missing tests of STA, lat and lon

  1. Output the file and replace the missing test with 99999

Final document format