RegEx

August 18, 2010

I’m in the process of parsing some data out of Environment Canada forecasts in .txt format.  Here’s an example of how they look:

FPCN11 CWVR 011746
FORECASTS FOR THE SOUTH COAST OF BRITISH COLUMBIA ISSUED BY
ENVIRONMENT CANADA AT 11.00 AM PDT THURSDAY 1 JUNE 2006 FOR TODAY AND
FRIDAY.
THE NEXT SCHEDULED FORECAST WILL BE ISSUED AT 4.00 PM.
GREATER VANCOUVER.
TODAY..CLOUDY WITH SUNNY PERIODS. SHOWERS BEGINNING NEAR NOON.
HIGH 17. UV INDEX 6 OR HIGH.
TONIGHT..SHOWERS ENDING OVERNIGHT THEN CLOUDY WITH 30 PERCENT CHANCE
OF SHOWERS. LOW 13.
FRIDAY..CLOUDY WITH SUNNY PERIODS. 30 PERCENT CHANCE OF SHOWERS IN
THE MORNING. HIGH 19.
GREATER VICTORIA.
TODAY..SHOWERS. HIGH 18. UV INDEX 3 OR MODERATE.
TONIGHT..SHOWERS ENDING OVERNIGHT THEN CLOUDY WITH 30 PERCENT CHANCE
OF SHOWERS. LOW 11.
FRIDAY..CLOUDY. 30 PERCENT CHANCE OF SHOWERS IN THE MORNING. HIGH 19.
FRASER VALLEY.
TODAY..CLOUDY. SHOWERS BEGINNING LATE THIS MORNING. HIGH 21. UV
INDEX 3 OR MODERATE.
TONIGHT..SHOWERS. LOW 13.
FRIDAY..CLOUDY WITH 60 PERCENT CHANCE OF SHOWERS. HIGH 20.
HOWE SOUND.
TODAY..CLOUDY WITH SUNNY PERIODS. RAIN BEGINNING THIS AFTERNOON.
HIGH 19. UV INDEX 7 OR HIGH.
TONIGHT..PERIODS OF RAIN. LOW 12.
FRIDAY..CLOUDY WITH SUNNY PERIODS AND 60 PERCENT CHANCE OF SHOWERS.
HIGH 19.
WHISTLER.
TODAY..CLOUDY. RAIN BEGINNING THIS AFTERNOON. HIGH 19. UV INDEX 3 OR
MODERATE.
TONIGHT..PERIODS OF RAIN. LOW 7.
FRIDAY..PERIODS OF RAIN ENDING LATE IN THE DAY THEN CLOUDY. HIGH 16.
SUNSHINE COAST.
TODAY..SHOWERS. HIGH 18. UV INDEX 3 OR MODERATE.
TONIGHT..SHOWERS. LOW 13.
FRIDAY..CLOUDY. HIGH 17.
SOUTHERN GULF ISLANDS.
TODAY..SHOWERS. HIGH 18. UV INDEX 3 OR MODERATE.
TONIGHT..SHOWERS ENDING OVERNIGHT THEN CLOUDY WITH 30 PERCENT CHANCE
OF SHOWERS. LOW 12.
FRIDAY..CLOUDY WITH SUNNY PERIODS. 30 PERCENT CHANCE OF SHOWERS IN
THE MORNING. HIGH 20.
EAST VANCOUVER ISLAND.
TODAY..SHOWERS. WINDY. HIGH 18. UV INDEX 3 OR MODERATE.
TONIGHT..SHOWERS ENDING OVERNIGHT THEN CLOUDY. LOW 11.
FRIDAY..CLOUDY WITH SUNNY PERIODS AND 60 PERCENT CHANCE OF SHOWERS.
HIGH 18.
WEST VANCOUVER ISLAND.
TODAY..PERIODS OF RAIN. AMOUNT 5 MM. FOG PATCHES DISSIPATING LATE
THIS MORNING. WINDY. HIGH 15.
TONIGHT..PERIODS OF RAIN. AMOUNT 10 MM. WINDY. LOW 10.
FRIDAY..PERIODS OF RAIN. AMOUNT 5 TO 10 MM. BECOMING WINDY. HIGH 15.
INLAND VANCOUVER ISLAND.
TODAY..PERIODS OF RAIN. HIGH 19.
TONIGHT..PERIODS OF RAIN. AMOUNT 10 MM. LOW 11.
FRIDAY..PERIODS OF RAIN. AMOUNT 10 TO 15 MM. HIGH 18.
END/LAK

FPCN11 CWVR 011746FORECASTS FOR THE SOUTH COAST OF BRITISH COLUMBIA ISSUED BYENVIRONMENT CANADA AT 11.00 AM PDT THURSDAY 1 JUNE 2006 FOR TODAY ANDFRIDAY.THE NEXT SCHEDULED FORECAST WILL BE ISSUED AT 4.00 PM.
GREATER VANCOUVER.TODAY..CLOUDY WITH SUNNY PERIODS. SHOWERS BEGINNING NEAR NOON. HIGH 17. UV INDEX 6 OR HIGH.TONIGHT..SHOWERS ENDING OVERNIGHT THEN CLOUDY WITH 30 PERCENT CHANCE OF SHOWERS. LOW 13.FRIDAY..CLOUDY WITH SUNNY PERIODS. 30 PERCENT CHANCE OF SHOWERS IN THE MORNING. HIGH 19.
GREATER VICTORIA.TODAY..SHOWERS. HIGH 18. UV INDEX 3 OR MODERATE.TONIGHT..SHOWERS ENDING OVERNIGHT THEN CLOUDY WITH 30 PERCENT CHANCE OF SHOWERS. LOW 11.FRIDAY..CLOUDY. 30 PERCENT CHANCE OF SHOWERS IN THE MORNING. HIGH 19.
FRASER VALLEY.TODAY..CLOUDY. SHOWERS BEGINNING LATE THIS MORNING. HIGH 21. UV INDEX 3 OR MODERATE.TONIGHT..SHOWERS. LOW 13.FRIDAY..CLOUDY WITH 60 PERCENT CHANCE OF SHOWERS. HIGH 20.
HOWE SOUND.TODAY..CLOUDY WITH SUNNY PERIODS. RAIN BEGINNING THIS AFTERNOON. HIGH 19. UV INDEX 7 OR HIGH.TONIGHT..PERIODS OF RAIN. LOW 12.FRIDAY..CLOUDY WITH SUNNY PERIODS AND 60 PERCENT CHANCE OF SHOWERS. HIGH 19.
WHISTLER.TODAY..CLOUDY. RAIN BEGINNING THIS AFTERNOON. HIGH 19. UV INDEX 3 OR MODERATE.TONIGHT..PERIODS OF RAIN. LOW 7.FRIDAY..PERIODS OF RAIN ENDING LATE IN THE DAY THEN CLOUDY. HIGH 16.
SUNSHINE COAST.TODAY..SHOWERS. HIGH 18. UV INDEX 3 OR MODERATE.TONIGHT..SHOWERS. LOW 13.FRIDAY..CLOUDY. HIGH 17.
SOUTHERN GULF ISLANDS.TODAY..SHOWERS. HIGH 18. UV INDEX 3 OR MODERATE.TONIGHT..SHOWERS ENDING OVERNIGHT THEN CLOUDY WITH 30 PERCENT CHANCE OF SHOWERS. LOW 12.FRIDAY..CLOUDY WITH SUNNY PERIODS. 30 PERCENT CHANCE OF SHOWERS IN THE MORNING. HIGH 20.
EAST VANCOUVER ISLAND.TODAY..SHOWERS. WINDY. HIGH 18. UV INDEX 3 OR MODERATE.TONIGHT..SHOWERS ENDING OVERNIGHT THEN CLOUDY. LOW 11.FRIDAY..CLOUDY WITH SUNNY PERIODS AND 60 PERCENT CHANCE OF SHOWERS. HIGH 18.
WEST VANCOUVER ISLAND.TODAY..PERIODS OF RAIN. AMOUNT 5 MM. FOG PATCHES DISSIPATING LATE THIS MORNING. WINDY. HIGH 15.TONIGHT..PERIODS OF RAIN. AMOUNT 10 MM. WINDY. LOW 10.FRIDAY..PERIODS OF RAIN. AMOUNT 5 TO 10 MM. BECOMING WINDY. HIGH 15.
INLAND VANCOUVER ISLAND.TODAY..PERIODS OF RAIN. HIGH 19.TONIGHT..PERIODS OF RAIN. AMOUNT 10 MM. LOW 11.FRIDAY..PERIODS OF RAIN. AMOUNT 10 TO 15 MM. HIGH 18.
END/LAK

The code has come along more easily than I thought it would, with many thanks to Regex Buddy (highly recommended if you’re ever trying to wrap your head around regular expressions).  The most frustrating challenge has actually been accounting for changes in the structure of the forecast data from year to year.

  • In 2004 (I originally wrote the code for this file) the forecasts were at 5AM, 10AM and 4PM daily.
  • The beginning of a scheduled forecast started with simply “AT 5.00 AM”.
  • The beginning of the Greater Vancouver forecast started with “GREATER VANCOUVER”
  • The beginning of the Fraser Valley forecast started with “LOWER FRASER VALLEY”.
  • In 2005 the  Fraser Valley forecast started with either “FRASER VALLEY” or “FRASER VALLEY WEST”.
  • In 2006 the forecasts were at 5AM, 11AM and 4PM daily
  • The beginning of a scheduled forecast started with ” ENVIRONMENT CANADA AT 5.00 AM”.
  • 2007 remained blessedly the same as 2006
  • In 2008 the Greater Vancouver forecast started with “METRO VANCOUVER”
  • In 2009 Environment Canada changed the 4PM forecast to predict values for both “TOMORROW..: and “TOMORROW NIGHT..”

And this is just how things go.  I simply had to further generalize the code for each new year that I tried to run (except the blessed 2007) and I’m sure that I’ll have to do more work on it down the line.  Definitely a pain in the ass and definitely a great learning experience.

The Legend of R Legends

July 27, 2010

This is a reminder to myself, to be immortalized so that I NEVER have to struggle with it again.  When making an R legend with points and lines, the command to give POINT, LINE, LINE, POINT (for example) is pch = c(1, NA, NA, 1) and lty = (NA, 1, 1, NA).  For some reason I always think it is NULL rather than NA and I always end up being confused.  No more!

Different Definitions

July 27, 2010

We had a group meeting last week about the ongoing counterfactual debate.  It was good…conversation was had and decisions were made.  We will start off all analyses with a global counterfactual value of zero, and then we’ll examine some different scenarios like a minimum annual average, a 50% overall reduction in emissions, etc.  As part of this discussion Ruth Defries suggested a new approach for the calculation of a minimum annual average.  Instead of taking the minimum 12-month running average (as I had already done), take the minimum January, February, etc. and average them together.  I did that on the train this morning, and here’s how they compare:

New = TRUE

July 8, 2010

I like to plot complicated things in R and I’m pretty good at it.  One of my favorite tricks is to pile a whole bunch of plots on top of each other by manipulating the fig() argument of par().  I first learned this in SPLUS, and it hasn’t transferred as smoothly to R as I had hoped, mainly because any invocation of fig() wipes the device without a “new = T” argument also passed to par().  Even knowing this, I was still having difficulty getting things to look the way I wanted them, and today I finally figured out why — after a new par(fig()) command R wants a new high-level plot first and foremost, even if it’s empty.  Sometimes *really* reading the help files actually helps.

Nisga’a

July 7, 2010

This is the name of one of many of the First Nations of British Columbia.  And also the reason why it took me nearly two hours to import a fairly straight-forward data set into R this morning.  About 20 of the 56000 entries pertained to the Nisga’a region, and that apostrophe really, really screwed up the reading of the otherwise comma-delimited data.  I’m not sure of the mechanics, bur R essentially skipped all entries between each appearance of “Nisga’a”, reducing the total number of records read to ~27000.  The first skip was at record ~1800 and the last skip was at ~49000, so the head() and tail() of the file look just fine, as did all of the ~5 random unique IDs that I checked.  I eventually read in the unique IDs alone and compared the ID field of the imported set using setdiff() to find the first incidence of trouble.  After that the situation became clear.  Never a dull moment!

Days Over Days

June 25, 2010

The Scenario 1 estimates require me to estimate the number of annual days over a set of given concentrations (the set I’m starting out with is 3, 5, 10, 15, 20, 30, 40, 50, 75, 100, 150, 200, 250 and 300 — this is open to revisitation down the line).  Now, if there are 23 days with a concentration greater than 20 and 12 days with a concentration greater than 30, 12 of the 23 days need to be removed from the 20 ug/m3 estimate.  For example, take row 2085 of my days_over data for the modeled estimates:

[1] 129  94  28  18  17  15  14  13  13  11  10   9   8

Apply the following:

for (i in 2:length(concs)){
days_over[,i-1] = days_over[,i-1] – days_over[,i]
}

And get this result:

[1] 35 66 10  1  2  1  1  0  2  1  1  1  8

What I see from this is that the lower concentrations need to be better fleshed out and we don’t need so much definition through the higher concentrations.  I also see that my code is working as expected, which is better than I can say about it at this time last week.

These figures aren’t as carefully plotted as they should be (pic a sig fig!), but you get the picture.  What I’ve done here is:

  1. Calculate the minimum 12-month running average from the 2001-2006 monthly time series in each cell.
  2. Taken the 99th percentile and higher as the HIGH category, and assigned the mean value of all cells in the category.
  3. Taken the 97th to 99th percentile as the MID-HIGH category, ditto.
  4. Taken the 90th to 97th percentile as the MID-LOW category.
  5. Everything else is low.

Timelines

June 17, 2010

One persistent challenge with the GBD analyses is the mismatch between modeled and remote sensing PM data.  The former spans 1997 through 2006 while the latter is limited to 2001 through 2006.  This is mostly problematic because 1997 through 2000 were funny years in terms of the southern oscillation, with a very strong ENSO year and a very strong LNSO year.  On this morning’s train ride I think I have decided to limit the main analyses to 2001-2006 so that the results are comparable across metrics, then we can use the modeled 1997-2000 data to investigate a few special cases.  This requires some re-jigging of the data on my behalf, but it will be worth it in the end.

The next steps are to:

  • Calculate 12-month minimums for Model, MODIS and MISR for the 2001-2006 period
  • Spatial smoothing of the results to get PM-specific counterfactual surfaces for S2 and S3 calculations
  • Recalculate base-case estimates for S2 and S3 (new counterfactual, max PM = 50 ug/m3) for the 2001-2006 average
  • Make ENSO estimates for each
  • Recalculate the S1 running quantiles for Model, MODIS and MISR
  • Re-estimate the number of days over the target set of concentrations
  • Re-estimate and plot S1 values

After a long talk with Mike last week I am back to humming and hawing over the frakking counterfactual question.  In general Mike liked the idea of using a strong La Nina year, but felt that it wasn’t strict enough.  If the value is 20 ug/m3 in one cell and only 15 ug/m3 in a neighbouring cell, what rationale is there for using 20 when 15 is obviously achievable in the region?  He has a point.  And I have a point by arguing that some values simply aren’t achievable in some areas, due to their fire regimes – natural or otherwise.  Evening though a value of 5 might be *theoretically* possible in a cell with a La Nina value of 20, it isn’t practically possible.  So I went back to the drawing board and calculated the absolute 12-month running average for each cell in the study area, which gives this:

CropperCapture[2]

Compare it to these La Nina values:

CropperCapture[3] 

The absolutely minimum running average is better, I think…more reasonable.  The next step would be to do some kind of spatial smoothing on the values to get more regional counterfactuals.

More on Ratios

May 14, 2010

This is actually my last day at the Menzies in Hobart, and I should be packing up my desk etc. etc. but this problem is WAY more interesting than those banalities.  Here is the ratio of LNSO to 2001-2006 averages for the MODEL data…note how everything is very flat across central Africa.

Multiplying the 2001-2006 MODIS average by this is going to give LNSO values in central Africa that are similar to the 2001-2006 values.  This is partially realistic, because concentrations are just always high in Central Africa due mostly to regular burning of the savanna.  On the other hand, it doesn’t account for the differences we see between MODEL and MODIS/MISR estimates in this area:

So what happens if, instead of applying the LNSO:2001-2006 MODEL ratio to MODIS to estimate the MODIS LNSO, I instead apply the MODEL:MODIS ratio to the MODEL LNSO values to estimate the MODIS LNSO.  I’ve still got to think about this, but sometime after I’ve actually cleaned up my desk, etc. etc.

Follow

Get every new post delivered to your Inbox.