March 19, 2022

Cleaning Messy National Weather Service Data for Portland

Loading NWS Data

I first will try to load it without any intervention to see what it looks like. As we will see, it is quite messy in a few easy ways and a few that are a bit more tricky.

NWS <- read.csv(url("https://www.weather.gov/source/pqr/climate/webdata/Portland_dailyclimatedata.csv"))
head(NWS, 10) %>% kable() %>%
    kable_styling() %>%
    scroll_box(width = "100%", height = "500px")

Daily.Temperature.and.Precipitation.Data	X	X.1	X.2	X.3	X.4	X.5	X.6	X.7	X.8	X.9	X.10	X.11	X.12	X.13	X.14	X.15	X.16	X.17	X.18	X.19	X.20	X.21	X.22	X.23	X.24	X.25	X.26	X.27	X.28	X.29	X.30	X.31	X.32	X.33
Portland, Oregon Airport
Data Starts:			13 Oct 1940				END Year:		2019
	TX is Maximum Temperature (deg F), TX is Minimum Temperature (deg F), PR is Precipitation (inches), SN is Snowfall (inches)
	Example: High Temperature 23 October 1940 is 58 while low was 53 deg.

YR	MO		1	2	3	4	5	6	7	8	9	10	11	12	13	14	15	16	17	18	19	20	21	22	23	24	25	26	27	28	29	30	31	AVG or Total
1940	10	TX	M	M	M	M	M	M	M	M	M	M	M	M	75	70	64	72	72	78	78	64	63	61	58	57	57	57	56	53	59	59	52	M
1940	10	TN	M	M	M	M	M	M	M	M	M	M	M	M	57	53	52	50	58	58	59	54	48	41	53	48	41	38	37	45	48	50	46	M
1940	10	PR	M	M	M	M	M	M	M	M	M	M	M	M	0.01	T	T	0	0.13	0	T	0.14	0.05	0	0.63	1.03	0	0	T	0.18	0.58	0.5	0.25	M
1940	10	SN	M	M	M	M	M	M	M	M	M	M	M	M	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0

The column names are stored in the seventh row; to properly import this. In addition, there are two missing value codes: M and - that will have to be accounted for. I will use skip to skip the first 6 rows and declare two distinct values to be encoded as missing. Let’s see what we get.

NWS <- read.csv(url("https://www.weather.gov/source/pqr/climate/webdata/Portland_dailyclimatedata.csv"), skip=6, na.strings = c("M","-"))
head(NWS, 10)

##      YR MO  X   X1   X2   X3   X4   X5   X6   X7   X8   X9  X10  X11  X12  X13
## 1  1940 10 TX <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA>   75
## 2  1940 10 TN <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA>   57
## 3  1940 10 PR <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> 0.01
## 4  1940 10 SN <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA>    0
## 5  1940 11 TX   52   53   47   55   51   58   56   50   48   47   46   45   45
## 6  1940 11 TN   40   38   36   32   42   46   46   42   35   34   35   33   34
## 7  1940 11 PR 0.17 0.02    T    0 0.07 0.28 0.85 0.29 0.02 0.01 0.01    0    0
## 8  1940 11 SN    0    0    0    0    0    0    0    0    0    0    0    0    0
## 9  1940 12 TX   51   53   52   51   56   54   50   51   48   50   46   45   43
## 10 1940 12 TN   42   40   42   42   44   37   34   35   32   26   34   28   27
##    X14 X15 X16  X17  X18 X19  X20  X21 X22  X23  X24 X25 X26  X27  X28  X29 X30
## 1   70  64  72   72   78  78   64   63  61   58   57  57  57   56   53   59  59
## 2   53  52  50   58   58  59   54   48  41   53   48  41  38   37   45   48  50
## 3    T   T   0 0.13    0   T 0.14 0.05   0 0.63 1.03   0   0    T 0.18 0.58 0.5
## 4    0   0   0    0    0   0    0    0   0    0    0   0   0    0    0    0   0
## 5   47  53  49   46   49  46   49   50  44   42   44  51  44   45   59   57  45
## 6   33  28  27   36   30  29   36   33  28   37   35  37  36   38   43   40  39
## 7    0   0   0 0.29 0.01   0 0.37    T   0 0.12 0.62   0   0 0.51 0.89    T   T
## 8    0   0   0    0    0   0    0    0   0    0    0   0   0    0    0    0   0
## 9   40  39  39   41   41  45   46   62  60   56   53  54  45   50   51   43  44
## 10  25  29  33   35   34  35   41   39  39   42   42  42  40   38   36   35  37
##     X31 AVG.or.Total
## 1    52         <NA>
## 2    46         <NA>
## 3  0.25         <NA>
## 4     0            0
## 5  <NA>         49.1
## 6  <NA>         35.9
## 7  <NA>         4.53
## 8  <NA>            0
## 9    45         48.5
## 10   32           36

Two other things are of note. The first one is that R really doesn’t like columns to be named as numbers so we have an X in front of the numeric days. The second is that the column denoting which variable the rows represent is now X. Let me rename X to be Variable.

NWS <- read.csv(url("https://www.weather.gov/source/pqr/climate/webdata/Portland_dailyclimatedata.csv"), skip=6, na.strings = c("M","-")) %>% 
  rename(Variable = X)
str(NWS)

## 'data.frame':    3808 obs. of  35 variables:
##  $ YR          : int  1940 1940 1940 1940 1940 1940 1940 1940 1940 1940 ...
##  $ MO          : int  10 10 10 10 11 11 11 11 12 12 ...
##  $ Variable    : chr  "TX" "TN" "PR" "SN" ...
##  $ X1          : chr  NA NA NA NA ...
##  $ X2          : chr  NA NA NA NA ...
##  $ X3          : chr  NA NA NA NA ...
##  $ X4          : chr  NA NA NA NA ...
##  $ X5          : chr  NA NA NA NA ...
##  $ X6          : chr  NA NA NA NA ...
##  $ X7          : chr  NA NA NA NA ...
##  $ X8          : chr  NA NA NA NA ...
##  $ X9          : chr  NA NA NA NA ...
##  $ X10         : chr  NA NA NA NA ...
##  $ X11         : chr  NA NA NA NA ...
##  $ X12         : chr  NA NA NA NA ...
##  $ X13         : chr  "75" "57" "0.01" "0" ...
##  $ X14         : chr  "70" "53" "T" "0" ...
##  $ X15         : chr  "64" "52" "T" "0" ...
##  $ X16         : chr  "72" "50" "0" "0" ...
##  $ X17         : chr  "72" "58" "0.13" "0" ...
##  $ X18         : chr  "78" "58" "0" "0" ...
##  $ X19         : chr  "78" "59" "T" "0" ...
##  $ X20         : chr  "64" "54" "0.14" "0" ...
##  $ X21         : chr  "63" "48" "0.05" "0" ...
##  $ X22         : chr  "61" "41" "0" "0" ...
##  $ X23         : chr  "58" "53" "0.63" "0" ...
##  $ X24         : chr  "57" "48" "1.03" "0" ...
##  $ X25         : chr  "57" "41" "0" "0" ...
##  $ X26         : chr  "57" "38" "0" "0" ...
##  $ X27         : chr  "56" "37" "T" "0" ...
##  $ X28         : chr  "53" "45" "0.18" "0" ...
##  $ X29         : chr  "59" "48" "0.58" "0" ...
##  $ X30         : chr  "59" "50" "0.5" "0" ...
##  $ X31         : chr  "52" "46" "0.25" "0" ...
##  $ AVG.or.Total: chr  NA NA NA "0" ...

It is disappointing that everything is stored as character type. That will prove advantageous in one respect because there is some /A garbage embedded in two of the variables (SN and PR). Here, I will ask R to find all columns that are stored as character and ask it to remove the string.

NWS <- NWS %>% mutate(across(where(is.character), ~str_remove(.x, "/A")))

Now, we will have to fix the values T in the precipitation and snow variables [which are currently stored in repeated rows]. Nevertheless, this should give me what I need to create the monthly data.

Monthly Data

Now I will have to facilitate type conversions; the data are currently character instead of numeric. Looking at the data, I think it is reasonable, though not with total certainty, that they should take values bigger than 0 but smaller than the smallest measured values for that Variable; in this case, the smallest values that I note are 0.01. I will (somewhat) arbitrarily set them to 0.005. With that in mind, let’s see if we can create the monthly data by adjusting those values and selecting only the year, the month. the variable, and AVG.or.Total – the last column that contains the average high/low and the total snowfall and precipitation for that month of that year. Finally, let me change the type of that last column to numeric.

# Now to create a Monthly time series.  The first step is to select the four columns 
NWS.Monthly.Base <- NWS %>% 
  select(YR, MO, Variable, AVG.or.Total) %>% 
  mutate(AVG.or.Total = recode(AVG.or.Total, T = "O.005", `T/A` = "O.005")) %>% # Replace the trace amounts with numbers as they are not missing nor are they zero.
  mutate(AVG.or.Total = stringr::str_replace_all(AVG.or.Total, "/A", "")) %>%
  mutate(AVG.or.Total = as.numeric(AVG.or.Total)) # Turn it into numeric
summary(NWS.Monthly.Base)  # Show a summary

##        YR             MO           Variable          AVG.or.Total  
##  Min.   :1940   Min.   : 1.000   Length:3808        Min.   : 0.00  
##  1st Qu.:1960   1st Qu.: 3.750   Class :character   1st Qu.: 0.78  
##  Median :1980   Median : 7.000   Mode  :character   Median :33.25  
##  Mean   :1980   Mean   : 6.508                      Mean   :28.55  
##  3rd Qu.:2000   3rd Qu.:10.000                      3rd Qu.:52.50  
##  Max.   :2020   Max.   :12.000                      Max.   :88.10  
##                                                     NA's   :112

That seems to work. Two more steps to get a clean monthly dataset. The last step is to create four variables named TX, TN, PR, and SN as unique data.frames. To make merging them by a common index easier, I will use paste to create a character variable of the data and then turn it into a Date: Month.YR.

TX <- NWS.Monthly.Base %>% filter(Variable == "TX") %>% 
    mutate(Month.Yr = as.Date(paste0(MO,"-14-",YR), format = "%m-%d-%Y"),
           TX = AVG.or.Total) %>%
    select(-c(Variable,AVG.or.Total))
TN <- NWS.Monthly.Base %>% filter(Variable == "TN") %>% 
    mutate(Month.Yr = as.Date(paste0(MO,"-14-",YR), format = "%m-%d-%Y"),
           TN = AVG.or.Total) %>% 
    select(-c(Variable,AVG.or.Total))
PR <- NWS.Monthly.Base %>% filter(Variable == "PR") %>% 
    mutate(Month.Yr = as.Date(paste0(MO,"-14-",YR), format = "%m-%d-%Y"),
           PR = AVG.or.Total) %>%
    select(-c(Variable,AVG.or.Total))
SN <- NWS.Monthly.Base %>% filter(Variable == "SN") %>% 
    mutate(Month.Yr = as.Date(paste0(MO,"-14-",YR), format = "%m-%d-%Y"),
           SN = AVG.or.Total) %>%
    select(-c(Variable,AVG.or.Total))

The last step is to merge the clean individual series together; I will use left_join.

NWS.Monthly.Clean <- left_join(TX,TN) %>% left_join(.,PR) %>% left_join(.,SN)

Woo hoo! One down and one to go. But, as we should’ve noticed before, their totals aren’t actually great.

Maximum Temperature

NWS.Monthly.Clean %>% ggplot() + aes(x=Month.Yr, y=TX) + geom_line()

Minimum Temperature

NWS.Monthly.Clean %>% ggplot() + aes(x=Month.Yr, y=TN) + geom_line()

Snow

NWS.Monthly.Clean %>% ggplot() + aes(x=Month.Yr, y=SN) + geom_line()

Precipitation

NWS.Monthly.Clean %>% ggplot() + aes(x=Month.Yr, y=PR) + geom_line()