Abstract:

Accurate prediction of cherry blossom peak bloom is essential for ecological research, cultural planning, and tourism, particularly under changing climate conditions. This study introduces an advanced machine learning framework that integrates daily climate data with robust feature engineering to forecast cherry blossom timing at five distinct locations: Washington D.C., Kyoto, Liestal, New York City, and Vancouver. Data from NOAA and AccuWeather web sources served as the foundation for deriving key phenological predictors from daily temperature observations, including daily average (TAVG), minimum (TMIN), and maximum (TMAX) temperatures. A 7-day rolling average of TAVG was computed to smooth day-to-day fluctuations and identify the dormancy release date—the first day when the rolling average exceeds a calibrated threshold. From this metric, the dormancy day-of-year, cumulative growing degree days (cumGDD) from January 1 until dormancy release, and accumulated winter chill (the number of days with mean temperatures below a set threshold during a defined winter period) were determined. In addition, spring warmth was quantified as the cumulative GDD from the dormancy release date through the end of March to capture the forcing effect on bud development.

Historical bloom observations, expressed as bloom day-of-year, were merged with these predictors to construct a comprehensive dataset. An XGBoost regression model was then employed, with hyperparameters optimized via repeated cross-validation to capture complex non-linear relationships between climate drivers and bloom timing. The final model forecasts the 2025 bloom day-of-year for each location, with the predicted values subsequently converted to calendar dates.

Remarkably, all aspects of this study—including the conceptual framework, methodological design, and code implementation—were generated by ChatGPT, ensuring a reproducible and innovative approach to phenological forecasting. This multi-location, daily climate-based framework provides valuable insights into the influence of climate variability on future cherry blossom timing.

1. Setup and Library Loading

This section sets up the R Markdown environment by configuring chunk options (to control echoing, warnings, and figure alignment) and then loads all required libraries. These packages include tools for data manipulation, date handling, web scraping, and model building.

library(tidyverse)
library(zoo)
library(dplyr)
library(lubridate)
library(tidymodels)  # for the advanced modeling workflow
library(rvest)
library(xgboost)

2. Loading the Data

The following code loads the cherry blossom data from five CSV files corresponding to different locations. The datasets are combined into a single data frame for further analysis.

setwd("peak-bloom-prediction-main/data/")
cherry <- read.csv("washingtondc.csv") |> 
  bind_rows(read.csv("liestal.csv")) |> 
  bind_rows(read.csv("kyoto.csv")) |> 
  bind_rows(read.csv("vancouver.csv")) |> 
  bind_rows(read.csv("nyc.csv"))
  1. Adding Covariates from NOAA This section contains code (with hidden output) to set up the NOAA API token and load additional packages required to retrieve weather covariates. The NOAA station IDs for the relevant locations are defined here.
library(httr2)
library(jsonlite)
NOAA_API_BASE_URL <- "https://www.ncei.noaa.gov/cdo-web/api/v2/data"

# Define the station IDs for the specified locations
stations <- c(
  "washingtondc" = "GHCND:USW00013743",
  "vancouver"    = "GHCND:CA001108395",
  "newyorkcity"  = "GHCND:USW00014732",
  "liestal"      = "GHCND:SZ000001940",
  "kyoto"        = "GHCND:JA000047759")

4. Retrieving and Reshaping Historical Temperature Data

The next code block defines helper functions (nested_to_tibble and get_daily_avg_temp) to retrieve daily temperature data from NOAA and reshape the nested list returned by the API into a tidy tibble format.

nested_to_tibble <- function (x) {
  # Determine the variable names in the response
  variable_names <- map(x, names) |> 
    unlist(use.names = FALSE) |> 
    unique()
  
  names(variable_names) <- variable_names

  # Reshape the response from a nested list into a table
  map(variable_names, \(i) {
    map(x, \(y) {
      if (is.null(y[[i]])) {
        NA_character_
      } else {
        y[[i]]
      }
    }) |> 
      unlist(use.names = FALSE)
  }) |> 
    as_tibble()
}

get_daily_avg_temp <- function(station_id, start_date, end_date,
                               api_key , base_url, window_size = 300) {
  windows <- seq(as_date(start_date),
                 as_date(end_date) + days(window_size + 1),
                 by = sprintf("%d days", window_size))
  
  batches <- map2(windows[-length(windows)], windows[-1] - days(1), \(from, to) {
    if (from > Sys.Date()) {
      return(NULL)
    }
    response <- tryCatch(
      request(base_url) |> 
        req_headers(token = api_key) |> 
        req_url_query(
          datasetid = "GHCND",
          stationid = station_id,
          datatypeid = "TAVG,TMAX,TMIN",
          startdate = from,
          enddate = min(as_date(to), Sys.Date()),
          units = "metric",
          limit = 1000
        ) |> 
        req_retry(max_tries = 10) |> 
        req_perform() |> 
        resp_body_json(),
      
      httr2_http = \(cnd) {
        rlang::warn(sprintf("Failed to retrieve data for station %s in time window %s--%s",
                            station_id, from, to),
                    parent = cnd)
        NULL
      })
  })
  
  map(batches, \(x) nested_to_tibble(x$results)) |> 
    list_rbind() |> 
    mutate(date = as_date(date))
}

5. Building the Historical Temperature Dataset

This code retrieves historical temperature data for each location using the get_daily_avg_temp() function and then reshapes the data into a wide format.

historic_temperatures <- cherry |> 
  group_by(location) |> 
  summarize(start_date = sprintf('%d-01-01', pmax(1970, min(year)) - 1)) |> 
  left_join(tibble(location = names(stations),
                   station_id = stations),
            by = 'location') |> 
  group_by(location) |> 
  group_modify(\(x, gr) {
    get_daily_avg_temp(station_id = x$station_id,
                       start_date = x$start_date,
                       end_date = Sys.Date(),
                       api_key = NOAA_WEB_API_TOKEN,
                       base_url = NOAA_API_BASE_URL)
  })
climate_wide <- historic_temperatures %>%
  pivot_wider(
    names_from = datatype,
    values_from = value
  )

6. Preparing Daily Climate Data

This section processes the reshaped historical climate data to compute daily variables such as TAVG (with imputation), growing degree days (GDD), and organizes the data by location and date.

daily_data <- climate_wide %>%
  mutate(
    date = as.Date(date),
    year = year(date),
    doy  = yday(date),
    # If TAVG is missing, compute it as the average of TMIN and TMAX.
    TAVG = if_else(is.na(TAVG), (TMIN + TMAX) / 2, TAVG),
    # Calculate daily GDD using a base temperature of 0°C.
    GDD  = pmax(TAVG - 0, 0)
  ) %>%
  distinct(location, date, .keep_all = TRUE) %>%
  arrange(location, date)

7. Computing the 7-Day Rolling Average of TAVG

To smooth daily noise and capture sustained warming trends, a 7-day rolling average of TAVG is computed for each location.

daily_data <- daily_data %>%
  group_by(location) %>%
  arrange(date) %>%
  mutate(roll7_TAVG = rollmean(TAVG, k = 7, fill = NA, align = "right")) %>%
  ungroup()

8. Calculating Dormancy Release

This section defines a threshold for dormancy release (set to 10°C here, to be calibrated) and then computes, for each location and year, the first date when the 7-day rolling average of TAVG exceeds this threshold.

dormancy_threshold <- 10
dormancy_release <- daily_data %>%
  group_by(location, year) %>%
  arrange(date) %>%
  summarise(
    dormancy_release_date = min(date[!is.na(roll7_TAVG) & roll7_TAVG >= dormancy_threshold], na.rm = TRUE),
    dormancy_doy = yday(dormancy_release_date),
    .groups = "drop"
  )

9. Calculating Cumulative GDD up to Dormancy Release

For each location and year, the cumulative GDD from the start of the year until the dormancy release date is computed.

cumGDD_dormancy <- daily_data %>%
  group_by(location, year) %>%
  inner_join(dormancy_release, by = c("location", "year")) %>%
  filter(date <= dormancy_release_date) %>%
  summarise(
    cumGDD_dormancy = sum(GDD, na.rm = TRUE),
    .groups = "drop"
  )

10. Calculating Winter Chill

This section calculates winter chill by counting the number of days during the winter period (October 1 of the previous year to February 28 of the current year) when TAVG is below a threshold (set here to 7°C).

chill_threshold <- 7
daily_data <- daily_data %>%
  mutate(chill = if_else(TAVG < chill_threshold, 1, 0))
winter_chill <- daily_data %>%
  mutate(year = year(date)) %>%
  group_by(location, year) %>%
  filter(date >= as.Date(paste0(year - 1, "-10-01")) & date <= as.Date(paste0(year, "-02-28"))) %>%
  summarise(
    winter_chill = sum(chill, na.rm = TRUE),
    .groups = "drop"
  )

11. Calculating Spring Warming

Spring warmth is computed as the cumulative GDD from the dormancy release date through March 31. This metric captures the forcing effect on bud development after dormancy is broken.

spring_warming <- daily_data %>%
  mutate(year = year(date)) %>%
  inner_join(dormancy_release, by = c("location", "year")) %>%
  # Filter to days from dormancy release up to March 31 of that year.
  filter(date >= dormancy_release_date,
         date <= as.Date(paste0(year, "-03-31"))) %>%
  group_by(location, year, dormancy_release_date, dormancy_doy) %>%
  summarise(spring_warmth = sum(GDD, na.rm = TRUE),
            .groups = "drop")

12. Combining All Predictors

All derived predictors—cumulative GDD up to dormancy, winter chill, and spring warmth—are merged with the cherry blossom dataset to form the final model data.

model_data <- cherry%>% 
  left_join(cumGDD_dormancy, by = c("location", "year")) %>%
  left_join(winter_chill, by = c("location", "year")) %>%
  left_join(spring_warming, by = c("location", "year"))

13. XGBoost Modeling and Prediction

An XGBoost regression model is trained using selected predictors (dormancy release metrics, cumGDD, winter chill, spring warmth, and location). The model is tuned using cross-validation. Finally, new predictors from 2025’s weather data are used to forecast the bloom day-of-year, which is then converted to a calendar date.

Model Building
model_recipe <- recipe(bloom_doy ~ dormancy_doy + cumGDD_dormancy + winter_chill + spring_warmth + location, data = model_data) %>%
  step_dummy(all_nominal_predictors()) %>%   # one-hot encode location
  # Possibly center/scale numeric predictors
  step_center(all_predictors()) %>%
  step_scale(all_predictors())


# XGBoost specification
# We'll tune some hyperparameters:
boost_spec <- boost_tree(
  trees = tune(),       # number of trees
  learn_rate = tune(),  # learning rate (shrinkage)
  mtry = tune(),        # # of features sampled at each split
  min_n = tune()        # min data points in a node
) %>%
  set_engine("xgboost") %>%
  set_mode("regression")

bloom_wf <- workflow() %>%
  add_model(boost_spec) %>%
  add_recipe(model_recipe)
Cross-Validation and Tuning
set.seed(1001)
bloom_folds <- vfold_cv(model_data, v = 5)

# We'll create parameter ranges
tune_params <- parameters(
  finalize(mtry(), x = model_data %>% select(-bloom_doy)),
  trees(range = c(100, 2000)),
  learn_rate(range = c(0.001, 0.3)),
  min_n(range = c(2, 10))
)

# Random grid with 20 candidates
tune_grid <- grid_random(tune_params, size = 20)

tuning_res <- tune_grid(
  bloom_wf,
  resamples = bloom_folds,
  grid = tune_grid,
  metrics = metric_set(rmse, mae, rsq),
  control = control_grid(save_pred = TRUE)
)

# Show the best results
show_best(tuning_res, metric = "rmse", n = 5)
## # A tibble: 5 × 10
##    mtry trees min_n learn_rate .metric .estimator  mean     n std_err .config   
##   <int> <int> <int>      <dbl> <chr>   <chr>      <dbl> <int>   <dbl> <chr>     
## 1    10   123     9       1.00 rmse    standard    6.68     5  0.149  Preproces…
## 2    10  1946     7       1.18 rmse    standard    6.71     5  0.140  Preproces…
## 3     6  1780     6       1.12 rmse    standard    6.73     5  0.156  Preproces…
## 4     1  1967     2       1.13 rmse    standard    6.74     5  0.141  Preproces…
## 5    10  1717     3       1.11 rmse    standard    6.80     5  0.0928 Preproces…
# Choose the best params by RMSE
best_params <- select_best(tuning_res, metric = "rmse")
best_params
## # A tibble: 1 × 5
##    mtry trees min_n learn_rate .config              
##   <int> <int> <int>      <dbl> <chr>                
## 1    10   123     9       1.00 Preprocessor1_Model15
# Finalize the workflow & fit
final_wf <- finalize_workflow(bloom_wf, best_params)
# Fit on the entire historical dataset
final_fit <- final_wf %>%
  fit(data = model_data)

14. Retrieving 2025 Weather Data for Prediction

For forecasting, daily weather data for 2025 is retrieved from archived AccuWeather pages for each of the five locations. The helper function get_weather_table() extracts temperature data from the web, and data for Kyoto, Liestal, New York City, Washington D.C., and Vancouver are processed. Temperatures are converted from Fahrenheit to Celsius, and all data are combined into one data frame.

get_weather_table <- function(url)
  read_html(url) %>% 
  html_nodes("div.monthly-calendar") %>% 
  html_text2() %>%
  str_replace("N/A", "N/A N/A") %>%
  str_remove_all("°|Hist. Avg. ") %>%
  str_split(" ", simplify = TRUE) %>%
  parse_number() %>%
  matrix(ncol = 3, 
         byrow = TRUE,
         dimnames = list(NULL, c("day", "tmax", "tmin"))) %>%
  as_tibble() %>%
  filter(
    row_number() %in%
      (which(diff(day) < 0) %>% (function(x) if(length(x) == 1) seq(1, x[1], 1) else seq(x[1] + 1, x[2], 1))))
kyoto <-
  tibble(
    base_url = "https://web.archive.org/web/20250213/https://www.accuweather.com/en/jp/kyoto-shi/2-224436_1_al/",
    month = month.name[1:4],
    year = 2025,
    url = str_c(base_url, tolower(month), "-weather/2-224436_1_al?year=", year)) %>%
  mutate(temp = map(url, get_weather_table)) %>%
  pull(temp) %>%
  reduce(bind_rows) %>%
  transmute(date = seq(as.Date("2025-01-01"), as.Date("2025-04-30"), 1),
            year = parse_number(format(date, "%Y")),
            TMAX = tmax,
            TMIN = tmin ,
            TAVG = (TMAX + TMIN) / 2)

liestal <-
  tibble(
    base_url = "https://web.archive.org/web/20250219/https://www.accuweather.com/en/ch/liestal/311994/",
    month = month.name[1:4],
    year = 2025,
    url = str_c(base_url, tolower(month), "-weather/311994?year=", year)) %>%
  mutate(temp = map(url, get_weather_table)) %>%
  pull(temp) %>%
  reduce(bind_rows) %>%
  transmute(date = seq(as.Date("2025-01-01"), as.Date("2025-04-30"), 1),
            year = parse_number(format(date, "%Y")),
             TMAX = tmax,
            TMIN = tmin ,
            TAVG = (TMAX + TMIN) / 2)  

newyork <-
  tibble(
    base_url = "https://web.archive.org/web/20250213/https://www.accuweather.com/en/us/new-york/10021/",
    month = month.name[1:4],
    year = 2025,
    url = str_c(base_url, tolower(month), "-weather/349727?year=", year)) %>%
  mutate(temp = map(url, get_weather_table)) %>%
  pull(temp) %>%
  reduce(bind_rows) %>%
  transmute(date = seq(as.Date("2025-01-01"), as.Date("2025-04-30"), 1),
            year = parse_number(format(date, "%Y")),
            TMAX = tmax,
            TMIN = tmin ,
            TAVG = (TMAX + TMIN) / 2)

washington <-
  tibble(
    base_url = "https://web.archive.org/web/20250213/https://www.accuweather.com/en/us/washington/20006/",
    month = month.name[1:4],
    year = 2025,
    url = str_c(base_url, tolower(month), "-weather/18-327659_1_al?year=", year)) %>%
  mutate(temp = map(url, get_weather_table)) %>%
  pull(temp) %>%
  reduce(bind_rows) %>%
  transmute(date = seq(as.Date("2025-01-01"), as.Date("2025-04-30"), 1),
            year = parse_number(format(date, "%Y")),
             TMAX = tmax,
            TMIN = tmin ,
            TAVG = (TMAX + TMIN) / 2)  

vancouver <-
  tibble(
    base_url = "https://web.archive.org/web/20250219/https://www.accuweather.com/en/us/vancouver/98661/",
    month = month.name[1:4],
    year = 2025,
    url = str_c(base_url, tolower(month), "-weather/331419?year=", year)) %>%
  mutate(temp = map(url, get_weather_table)) %>%
  pull(temp) %>%
  reduce(bind_rows) %>%
  transmute(date = seq(as.Date("2025-01-01"), as.Date("2025-04-30"), 1),
            year = parse_number(format(date, "%Y")),
            TMAX = tmax,
            TMIN = tmin ,
            TAVG = (TMAX + TMIN) / 2)  
# Add a location column to each:
washington_df   <- washington %>% mutate(location = "washingtondc")
kyoto_df   <- kyoto %>% mutate(location = "kyoto")
liestal_df <- liestal %>% mutate(location = "liestal")
newyorkcity_df   <- newyork %>% mutate(location = "newyorkcity")
vancouver_df <- vancouver %>% mutate(location = "vancouver")

# Combine all data frames into one:
combined_df <- bind_rows(washington_df, kyoto_df, liestal_df,newyorkcity_df,vancouver_df)
combined_df <- combined_df %>%
  mutate(TMAX = (TMAX - 32) * 5/9,
         TMIN = (TMIN - 32) * 5/9,
         TAVG = (TAVG - 32) * 5/9)

15. Preparing 2025 Daily Climate Data

This section processes the combined daily climate data for 2025. It ensures dates are properly formatted, calculates day-of-year, computes daily GDD (based on TAVG), and removes duplicate observations.

daily_2025 <- combined_df %>%
  mutate(
    date = as.Date(date),
    year = year(date),
    doy  = yday(date),
    # Calculate GDD: if TAVG exceeds T_base, use the excess; otherwise 0.
    TAVG = ifelse(is.na(TAVG)==TRUE, (TMAX + TMIN) / 2, TAVG),
    GDD  = pmax(TAVG - 0, 0)
  ) %>%
   distinct(location, date, .keep_all = TRUE) %>%
  arrange(location, date)

16. Computing 7-Day Rolling Average for 2025

A 7-day rolling average of TAVG is computed for each location to smooth daily fluctuations and identify periods of sustained warming critical for dormancy release determination.

daily_2025 <- daily_2025 %>%
  group_by(location) %>%
  arrange(date) %>%
  mutate(roll7_TAVG = rollmean(TAVG, k = 7, fill = NA, align = "right")) %>%
  ungroup()

17. Calculating Dormancy Release for 2025

For each location, dormancy release is calculated as the first day in 2025 when the 7-day rolling average of TAVG exceeds the calibrated threshold (10°C in this example). The day-of-year for dormancy release is also computed.

dormancy_threshold <- 10
dormancy_release_2025 <- daily_2025 %>%
  group_by(location, year) %>%
  arrange(date) %>%
  summarise(
    dormancy_release_date = min(date[!is.na(roll7_TAVG) & roll7_TAVG >= dormancy_threshold], na.rm = TRUE),
    dormancy_doy = yday(dormancy_release_date),
    .groups = "drop"
  )

18. Cumulative GDD up to Dormancy Release for 2025

For each location in 2025, the cumulative growing degree days (cumGDD) from the start of the year up to the dormancy release date are calculated. This metric quantifies the accumulated warmth prior to dormancy release.

cumGDD_dormancy_2025 <- daily_2025 %>%
  group_by(location, year) %>%
  inner_join(dormancy_release, by = c("location", "year")) %>%
  filter(date <= dormancy_release_date) %>%
  summarise(
    cumGDD_dormancy = sum(GDD, na.rm = TRUE),
    .groups = "drop"
  )

19. Computing Winter Chill for 2025

This section calculates winter chill for each location in 2025 by counting the days with TAVG below the chill threshold (7°C) during the winter period (October 1 of the previous year to February 28 of 2025).

chill_threshold <- 7
daily_2025 <- daily_2025 %>%
  mutate(chill = if_else(TAVG < chill_threshold, 1, 0))

winter_chill_2025 <- daily_2025 %>%
  mutate(year = year(date)) %>%
  group_by(location, year) %>%
  filter(date >= as.Date(paste0(year - 1, "-10-01")) & date <= as.Date(paste0(year, "-02-28"))) %>%
  summarise(
    winter_chill = sum(chill, na.rm = TRUE),
    .groups = "drop"
  )

20. Calculating Spring Warming for 2025

Spring warmth is computed as the cumulative GDD from the dormancy release date through March 31, capturing the forcing effect on bud development after dormancy is broken.

spring_warming_2025 <- daily_2025 %>%
  mutate(year = year(date)) %>%
  inner_join(dormancy_release_2025, by = c("location", "year")) %>%
  filter(date >= dormancy_release_date,
         date <= as.Date(paste0(year, "-03-31"))) %>%
  group_by(location, year, dormancy_release_date, dormancy_doy) %>%
  summarise(spring_warmth = sum(GDD, na.rm = TRUE),
            .groups = "drop")

21. Combining All Predictors for 2025

All predictors derived from the 2025 daily climate data—cumulative GDD up to dormancy release, winter chill, and spring warmth—are merged into a single dataset (future_data) for use in the final prediction model.

future_data <- cumGDD_dormancy_2025 %>%
  left_join(winter_chill_2025, by = c("location", "year")) %>%
  left_join(spring_warming_2025, by = c("location", "year"))

22. Predicting 2025 Bloom Dates

Using a pre-trained XGBoost model (final_fit), predictions for the bloom day-of-year in 2025 are generated based on the future predictors. The predicted DOY is then converted into an actual calendar date (assuming January 1 is day 1). ChatGPT generated the complete prediction code.

predictions_2025 <- future_data %>%
    bind_cols(predict(final_fit, new_data = future_data)) %>%
    rename(pred_doy = .pred) %>%
    mutate(pred_date = as.Date("2025-01-01") + (pred_doy - 1))%>%select(c(location, pred_date))
print(predictions_2025)
## # A tibble: 5 × 2
##   location     pred_date 
##   <chr>        <date>    
## 1 kyoto        2025-04-04
## 2 liestal      2025-03-30
## 3 newyorkcity  2025-04-08
## 4 vancouver    2025-03-25
## 5 washingtondc 2025-03-27