Accurate prediction of cherry blossom peak bloom is essential for ecological research, cultural planning, and tourism, particularly under changing climate conditions. This study introduces an advanced machine learning framework that integrates daily climate data with robust feature engineering to forecast cherry blossom timing at five distinct locations: Washington D.C., Kyoto, Liestal, New York City, and Vancouver. Data from NOAA and AccuWeather web sources served as the foundation for deriving key phenological predictors from daily temperature observations, including daily average (TAVG), minimum (TMIN), and maximum (TMAX) temperatures. A 7-day rolling average of TAVG was computed to smooth day-to-day fluctuations and identify the dormancy release date—the first day when the rolling average exceeds a calibrated threshold. From this metric, the dormancy day-of-year, cumulative growing degree days (cumGDD) from January 1 until dormancy release, and accumulated winter chill (the number of days with mean temperatures below a set threshold during a defined winter period) were determined. In addition, spring warmth was quantified as the cumulative GDD from the dormancy release date through the end of March to capture the forcing effect on bud development.
Historical bloom observations, expressed as bloom day-of-year, were merged with these predictors to construct a comprehensive dataset. An XGBoost regression model was then employed, with hyperparameters optimized via repeated cross-validation to capture complex non-linear relationships between climate drivers and bloom timing. The final model forecasts the 2025 bloom day-of-year for each location, with the predicted values subsequently converted to calendar dates.
Remarkably, all aspects of this study—including the conceptual framework, methodological design, and code implementation—were generated by ChatGPT, ensuring a reproducible and innovative approach to phenological forecasting. This multi-location, daily climate-based framework provides valuable insights into the influence of climate variability on future cherry blossom timing.
This section sets up the R Markdown environment by configuring chunk options (to control echoing, warnings, and figure alignment) and then loads all required libraries. These packages include tools for data manipulation, date handling, web scraping, and model building.
library(tidyverse)
library(zoo)
library(dplyr)
library(lubridate)
library(tidymodels) # for the advanced modeling workflow
library(rvest)
library(xgboost)
The following code loads the cherry blossom data from five CSV files corresponding to different locations. The datasets are combined into a single data frame for further analysis.
setwd("peak-bloom-prediction-main/data/")
cherry <- read.csv("washingtondc.csv") |>
bind_rows(read.csv("liestal.csv")) |>
bind_rows(read.csv("kyoto.csv")) |>
bind_rows(read.csv("vancouver.csv")) |>
bind_rows(read.csv("nyc.csv"))
library(httr2)
library(jsonlite)
NOAA_API_BASE_URL <- "https://www.ncei.noaa.gov/cdo-web/api/v2/data"
# Define the station IDs for the specified locations
stations <- c(
"washingtondc" = "GHCND:USW00013743",
"vancouver" = "GHCND:CA001108395",
"newyorkcity" = "GHCND:USW00014732",
"liestal" = "GHCND:SZ000001940",
"kyoto" = "GHCND:JA000047759")
The next code block defines helper functions (nested_to_tibble and get_daily_avg_temp) to retrieve daily temperature data from NOAA and reshape the nested list returned by the API into a tidy tibble format.
nested_to_tibble <- function (x) {
# Determine the variable names in the response
variable_names <- map(x, names) |>
unlist(use.names = FALSE) |>
unique()
names(variable_names) <- variable_names
# Reshape the response from a nested list into a table
map(variable_names, \(i) {
map(x, \(y) {
if (is.null(y[[i]])) {
NA_character_
} else {
y[[i]]
}
}) |>
unlist(use.names = FALSE)
}) |>
as_tibble()
}
get_daily_avg_temp <- function(station_id, start_date, end_date,
api_key , base_url, window_size = 300) {
windows <- seq(as_date(start_date),
as_date(end_date) + days(window_size + 1),
by = sprintf("%d days", window_size))
batches <- map2(windows[-length(windows)], windows[-1] - days(1), \(from, to) {
if (from > Sys.Date()) {
return(NULL)
}
response <- tryCatch(
request(base_url) |>
req_headers(token = api_key) |>
req_url_query(
datasetid = "GHCND",
stationid = station_id,
datatypeid = "TAVG,TMAX,TMIN",
startdate = from,
enddate = min(as_date(to), Sys.Date()),
units = "metric",
limit = 1000
) |>
req_retry(max_tries = 10) |>
req_perform() |>
resp_body_json(),
httr2_http = \(cnd) {
rlang::warn(sprintf("Failed to retrieve data for station %s in time window %s--%s",
station_id, from, to),
parent = cnd)
NULL
})
})
map(batches, \(x) nested_to_tibble(x$results)) |>
list_rbind() |>
mutate(date = as_date(date))
}
This code retrieves historical temperature data for each location using the get_daily_avg_temp() function and then reshapes the data into a wide format.
historic_temperatures <- cherry |>
group_by(location) |>
summarize(start_date = sprintf('%d-01-01', pmax(1970, min(year)) - 1)) |>
left_join(tibble(location = names(stations),
station_id = stations),
by = 'location') |>
group_by(location) |>
group_modify(\(x, gr) {
get_daily_avg_temp(station_id = x$station_id,
start_date = x$start_date,
end_date = Sys.Date(),
api_key = NOAA_WEB_API_TOKEN,
base_url = NOAA_API_BASE_URL)
})
climate_wide <- historic_temperatures %>%
pivot_wider(
names_from = datatype,
values_from = value
)
This section processes the reshaped historical climate data to compute daily variables such as TAVG (with imputation), growing degree days (GDD), and organizes the data by location and date.
daily_data <- climate_wide %>%
mutate(
date = as.Date(date),
year = year(date),
doy = yday(date),
# If TAVG is missing, compute it as the average of TMIN and TMAX.
TAVG = if_else(is.na(TAVG), (TMIN + TMAX) / 2, TAVG),
# Calculate daily GDD using a base temperature of 0°C.
GDD = pmax(TAVG - 0, 0)
) %>%
distinct(location, date, .keep_all = TRUE) %>%
arrange(location, date)
To smooth daily noise and capture sustained warming trends, a 7-day rolling average of TAVG is computed for each location.
daily_data <- daily_data %>%
group_by(location) %>%
arrange(date) %>%
mutate(roll7_TAVG = rollmean(TAVG, k = 7, fill = NA, align = "right")) %>%
ungroup()
This section defines a threshold for dormancy release (set to 10°C here, to be calibrated) and then computes, for each location and year, the first date when the 7-day rolling average of TAVG exceeds this threshold.
dormancy_threshold <- 10
dormancy_release <- daily_data %>%
group_by(location, year) %>%
arrange(date) %>%
summarise(
dormancy_release_date = min(date[!is.na(roll7_TAVG) & roll7_TAVG >= dormancy_threshold], na.rm = TRUE),
dormancy_doy = yday(dormancy_release_date),
.groups = "drop"
)
For each location and year, the cumulative GDD from the start of the year until the dormancy release date is computed.
cumGDD_dormancy <- daily_data %>%
group_by(location, year) %>%
inner_join(dormancy_release, by = c("location", "year")) %>%
filter(date <= dormancy_release_date) %>%
summarise(
cumGDD_dormancy = sum(GDD, na.rm = TRUE),
.groups = "drop"
)
This section calculates winter chill by counting the number of days during the winter period (October 1 of the previous year to February 28 of the current year) when TAVG is below a threshold (set here to 7°C).
chill_threshold <- 7
daily_data <- daily_data %>%
mutate(chill = if_else(TAVG < chill_threshold, 1, 0))
winter_chill <- daily_data %>%
mutate(year = year(date)) %>%
group_by(location, year) %>%
filter(date >= as.Date(paste0(year - 1, "-10-01")) & date <= as.Date(paste0(year, "-02-28"))) %>%
summarise(
winter_chill = sum(chill, na.rm = TRUE),
.groups = "drop"
)
Spring warmth is computed as the cumulative GDD from the dormancy release date through March 31. This metric captures the forcing effect on bud development after dormancy is broken.
spring_warming <- daily_data %>%
mutate(year = year(date)) %>%
inner_join(dormancy_release, by = c("location", "year")) %>%
# Filter to days from dormancy release up to March 31 of that year.
filter(date >= dormancy_release_date,
date <= as.Date(paste0(year, "-03-31"))) %>%
group_by(location, year, dormancy_release_date, dormancy_doy) %>%
summarise(spring_warmth = sum(GDD, na.rm = TRUE),
.groups = "drop")
All derived predictors—cumulative GDD up to dormancy, winter chill, and spring warmth—are merged with the cherry blossom dataset to form the final model data.
model_data <- cherry%>%
left_join(cumGDD_dormancy, by = c("location", "year")) %>%
left_join(winter_chill, by = c("location", "year")) %>%
left_join(spring_warming, by = c("location", "year"))
An XGBoost regression model is trained using selected predictors (dormancy release metrics, cumGDD, winter chill, spring warmth, and location). The model is tuned using cross-validation. Finally, new predictors from 2025’s weather data are used to forecast the bloom day-of-year, which is then converted to a calendar date.
model_recipe <- recipe(bloom_doy ~ dormancy_doy + cumGDD_dormancy + winter_chill + spring_warmth + location, data = model_data) %>%
step_dummy(all_nominal_predictors()) %>% # one-hot encode location
# Possibly center/scale numeric predictors
step_center(all_predictors()) %>%
step_scale(all_predictors())
# XGBoost specification
# We'll tune some hyperparameters:
boost_spec <- boost_tree(
trees = tune(), # number of trees
learn_rate = tune(), # learning rate (shrinkage)
mtry = tune(), # # of features sampled at each split
min_n = tune() # min data points in a node
) %>%
set_engine("xgboost") %>%
set_mode("regression")
bloom_wf <- workflow() %>%
add_model(boost_spec) %>%
add_recipe(model_recipe)
set.seed(1001)
bloom_folds <- vfold_cv(model_data, v = 5)
# We'll create parameter ranges
tune_params <- parameters(
finalize(mtry(), x = model_data %>% select(-bloom_doy)),
trees(range = c(100, 2000)),
learn_rate(range = c(0.001, 0.3)),
min_n(range = c(2, 10))
)
# Random grid with 20 candidates
tune_grid <- grid_random(tune_params, size = 20)
tuning_res <- tune_grid(
bloom_wf,
resamples = bloom_folds,
grid = tune_grid,
metrics = metric_set(rmse, mae, rsq),
control = control_grid(save_pred = TRUE)
)
# Show the best results
show_best(tuning_res, metric = "rmse", n = 5)
## # A tibble: 5 × 10
## mtry trees min_n learn_rate .metric .estimator mean n std_err .config
## <int> <int> <int> <dbl> <chr> <chr> <dbl> <int> <dbl> <chr>
## 1 10 123 9 1.00 rmse standard 6.68 5 0.149 Preproces…
## 2 10 1946 7 1.18 rmse standard 6.71 5 0.140 Preproces…
## 3 6 1780 6 1.12 rmse standard 6.73 5 0.156 Preproces…
## 4 1 1967 2 1.13 rmse standard 6.74 5 0.141 Preproces…
## 5 10 1717 3 1.11 rmse standard 6.80 5 0.0928 Preproces…
# Choose the best params by RMSE
best_params <- select_best(tuning_res, metric = "rmse")
best_params
## # A tibble: 1 × 5
## mtry trees min_n learn_rate .config
## <int> <int> <int> <dbl> <chr>
## 1 10 123 9 1.00 Preprocessor1_Model15
# Finalize the workflow & fit
final_wf <- finalize_workflow(bloom_wf, best_params)
# Fit on the entire historical dataset
final_fit <- final_wf %>%
fit(data = model_data)
For forecasting, daily weather data for 2025 is retrieved from archived AccuWeather pages for each of the five locations. The helper function get_weather_table() extracts temperature data from the web, and data for Kyoto, Liestal, New York City, Washington D.C., and Vancouver are processed. Temperatures are converted from Fahrenheit to Celsius, and all data are combined into one data frame.
get_weather_table <- function(url)
read_html(url) %>%
html_nodes("div.monthly-calendar") %>%
html_text2() %>%
str_replace("N/A", "N/A N/A") %>%
str_remove_all("°|Hist. Avg. ") %>%
str_split(" ", simplify = TRUE) %>%
parse_number() %>%
matrix(ncol = 3,
byrow = TRUE,
dimnames = list(NULL, c("day", "tmax", "tmin"))) %>%
as_tibble() %>%
filter(
row_number() %in%
(which(diff(day) < 0) %>% (function(x) if(length(x) == 1) seq(1, x[1], 1) else seq(x[1] + 1, x[2], 1))))
kyoto <-
tibble(
base_url = "https://web.archive.org/web/20250213/https://www.accuweather.com/en/jp/kyoto-shi/2-224436_1_al/",
month = month.name[1:4],
year = 2025,
url = str_c(base_url, tolower(month), "-weather/2-224436_1_al?year=", year)) %>%
mutate(temp = map(url, get_weather_table)) %>%
pull(temp) %>%
reduce(bind_rows) %>%
transmute(date = seq(as.Date("2025-01-01"), as.Date("2025-04-30"), 1),
year = parse_number(format(date, "%Y")),
TMAX = tmax,
TMIN = tmin ,
TAVG = (TMAX + TMIN) / 2)
liestal <-
tibble(
base_url = "https://web.archive.org/web/20250219/https://www.accuweather.com/en/ch/liestal/311994/",
month = month.name[1:4],
year = 2025,
url = str_c(base_url, tolower(month), "-weather/311994?year=", year)) %>%
mutate(temp = map(url, get_weather_table)) %>%
pull(temp) %>%
reduce(bind_rows) %>%
transmute(date = seq(as.Date("2025-01-01"), as.Date("2025-04-30"), 1),
year = parse_number(format(date, "%Y")),
TMAX = tmax,
TMIN = tmin ,
TAVG = (TMAX + TMIN) / 2)
newyork <-
tibble(
base_url = "https://web.archive.org/web/20250213/https://www.accuweather.com/en/us/new-york/10021/",
month = month.name[1:4],
year = 2025,
url = str_c(base_url, tolower(month), "-weather/349727?year=", year)) %>%
mutate(temp = map(url, get_weather_table)) %>%
pull(temp) %>%
reduce(bind_rows) %>%
transmute(date = seq(as.Date("2025-01-01"), as.Date("2025-04-30"), 1),
year = parse_number(format(date, "%Y")),
TMAX = tmax,
TMIN = tmin ,
TAVG = (TMAX + TMIN) / 2)
washington <-
tibble(
base_url = "https://web.archive.org/web/20250213/https://www.accuweather.com/en/us/washington/20006/",
month = month.name[1:4],
year = 2025,
url = str_c(base_url, tolower(month), "-weather/18-327659_1_al?year=", year)) %>%
mutate(temp = map(url, get_weather_table)) %>%
pull(temp) %>%
reduce(bind_rows) %>%
transmute(date = seq(as.Date("2025-01-01"), as.Date("2025-04-30"), 1),
year = parse_number(format(date, "%Y")),
TMAX = tmax,
TMIN = tmin ,
TAVG = (TMAX + TMIN) / 2)
vancouver <-
tibble(
base_url = "https://web.archive.org/web/20250219/https://www.accuweather.com/en/us/vancouver/98661/",
month = month.name[1:4],
year = 2025,
url = str_c(base_url, tolower(month), "-weather/331419?year=", year)) %>%
mutate(temp = map(url, get_weather_table)) %>%
pull(temp) %>%
reduce(bind_rows) %>%
transmute(date = seq(as.Date("2025-01-01"), as.Date("2025-04-30"), 1),
year = parse_number(format(date, "%Y")),
TMAX = tmax,
TMIN = tmin ,
TAVG = (TMAX + TMIN) / 2)
# Add a location column to each:
washington_df <- washington %>% mutate(location = "washingtondc")
kyoto_df <- kyoto %>% mutate(location = "kyoto")
liestal_df <- liestal %>% mutate(location = "liestal")
newyorkcity_df <- newyork %>% mutate(location = "newyorkcity")
vancouver_df <- vancouver %>% mutate(location = "vancouver")
# Combine all data frames into one:
combined_df <- bind_rows(washington_df, kyoto_df, liestal_df,newyorkcity_df,vancouver_df)
combined_df <- combined_df %>%
mutate(TMAX = (TMAX - 32) * 5/9,
TMIN = (TMIN - 32) * 5/9,
TAVG = (TAVG - 32) * 5/9)
This section processes the combined daily climate data for 2025. It ensures dates are properly formatted, calculates day-of-year, computes daily GDD (based on TAVG), and removes duplicate observations.
daily_2025 <- combined_df %>%
mutate(
date = as.Date(date),
year = year(date),
doy = yday(date),
# Calculate GDD: if TAVG exceeds T_base, use the excess; otherwise 0.
TAVG = ifelse(is.na(TAVG)==TRUE, (TMAX + TMIN) / 2, TAVG),
GDD = pmax(TAVG - 0, 0)
) %>%
distinct(location, date, .keep_all = TRUE) %>%
arrange(location, date)
A 7-day rolling average of TAVG is computed for each location to smooth daily fluctuations and identify periods of sustained warming critical for dormancy release determination.
daily_2025 <- daily_2025 %>%
group_by(location) %>%
arrange(date) %>%
mutate(roll7_TAVG = rollmean(TAVG, k = 7, fill = NA, align = "right")) %>%
ungroup()
For each location, dormancy release is calculated as the first day in 2025 when the 7-day rolling average of TAVG exceeds the calibrated threshold (10°C in this example). The day-of-year for dormancy release is also computed.
dormancy_threshold <- 10
dormancy_release_2025 <- daily_2025 %>%
group_by(location, year) %>%
arrange(date) %>%
summarise(
dormancy_release_date = min(date[!is.na(roll7_TAVG) & roll7_TAVG >= dormancy_threshold], na.rm = TRUE),
dormancy_doy = yday(dormancy_release_date),
.groups = "drop"
)
For each location in 2025, the cumulative growing degree days (cumGDD) from the start of the year up to the dormancy release date are calculated. This metric quantifies the accumulated warmth prior to dormancy release.
cumGDD_dormancy_2025 <- daily_2025 %>%
group_by(location, year) %>%
inner_join(dormancy_release, by = c("location", "year")) %>%
filter(date <= dormancy_release_date) %>%
summarise(
cumGDD_dormancy = sum(GDD, na.rm = TRUE),
.groups = "drop"
)
This section calculates winter chill for each location in 2025 by counting the days with TAVG below the chill threshold (7°C) during the winter period (October 1 of the previous year to February 28 of 2025).
chill_threshold <- 7
daily_2025 <- daily_2025 %>%
mutate(chill = if_else(TAVG < chill_threshold, 1, 0))
winter_chill_2025 <- daily_2025 %>%
mutate(year = year(date)) %>%
group_by(location, year) %>%
filter(date >= as.Date(paste0(year - 1, "-10-01")) & date <= as.Date(paste0(year, "-02-28"))) %>%
summarise(
winter_chill = sum(chill, na.rm = TRUE),
.groups = "drop"
)
Spring warmth is computed as the cumulative GDD from the dormancy release date through March 31, capturing the forcing effect on bud development after dormancy is broken.
spring_warming_2025 <- daily_2025 %>%
mutate(year = year(date)) %>%
inner_join(dormancy_release_2025, by = c("location", "year")) %>%
filter(date >= dormancy_release_date,
date <= as.Date(paste0(year, "-03-31"))) %>%
group_by(location, year, dormancy_release_date, dormancy_doy) %>%
summarise(spring_warmth = sum(GDD, na.rm = TRUE),
.groups = "drop")
All predictors derived from the 2025 daily climate data—cumulative GDD up to dormancy release, winter chill, and spring warmth—are merged into a single dataset (future_data) for use in the final prediction model.
future_data <- cumGDD_dormancy_2025 %>%
left_join(winter_chill_2025, by = c("location", "year")) %>%
left_join(spring_warming_2025, by = c("location", "year"))
Using a pre-trained XGBoost model (final_fit), predictions for the bloom day-of-year in 2025 are generated based on the future predictors. The predicted DOY is then converted into an actual calendar date (assuming January 1 is day 1). ChatGPT generated the complete prediction code.
predictions_2025 <- future_data %>%
bind_cols(predict(final_fit, new_data = future_data)) %>%
rename(pred_doy = .pred) %>%
mutate(pred_date = as.Date("2025-01-01") + (pred_doy - 1))%>%select(c(location, pred_date))
print(predictions_2025)
## # A tibble: 5 × 2
## location pred_date
## <chr> <date>
## 1 kyoto 2025-04-04
## 2 liestal 2025-03-30
## 3 newyorkcity 2025-04-08
## 4 vancouver 2025-03-25
## 5 washingtondc 2025-03-27