Data

Data#

Independently of the model, data will be needed for either creating or validating the model. The amount of data required will be determined by a trade-off between complexity-affordability-accuracy , and the type of model. Data-driven models will likely require a broad range and large volume of data, while mechanistic models may only need data for a defined set of variables.

1.1 Data Sources#

Berkeley Earth Global Temperature

Daily average temperature observations
Resolution 1° x 1°
Source
License
Monthly mean temperature FAIRBANKS
80 km from Nenana, data from 1828 - 2020
Source
Monthly mean temperature HEALY RIVER AIRPORT
90 km from Nenana, data from 1983 - 2012
Source
Monthly mean temperature MOUNT MCKINLEY NATL
100 km from Nenana, data from 1923 - 2013
Source

NOAA GHCN: Nenana

NENANA MUNICIPAL AIRPORT, AK US (USW00026435)
- Rainfall
- Snowfall
- Snow-depth
Source
License

NOAA GHCN: Fairbanks

FAIRBANKS INTERNATIONAL AIRPORT, AK US (USW00026411)
- Rainfall
- Snowfall
- Snow-depth
- Air temperature
- Average wind speed
Source
License

USGS Water Data: Nenana

USGS 15515500 TANANA R AT NENANA AK
- Mean water temperature
- Mean discharge
Source
License

USGS Water Data: Fairbanks

USGS 15485500 Tanana R at Fairbanks AK
- Mean water temperature
- Mean discharge
Source
License

USGS Water Data: Fairbanks

USGS 15485500 Tanana R at Fairbanks AK
- Mean water temperature
- Mean discharge
Source
License

NERC-EDS Global Cloud Coverage

Global Cloud coverga percentage
Resolution 2° x 2°
Source accessed through KNMI
License

TEMIS Global Solar Surface Irradiance

Broadband Solar Surface Irradiance(0.2-4.0 micron)
Resolution 2° x 2°
Source accessed through KNMI
License

USGS Glaciers Data: Gulkana

Gulkana weather station (USGS 15478038)
- Air Temperature
- Precipitation
- Glacier-wide mass balance
Source
License

NOAA Global Indexes

ENSO-Southern oscillation index
Pacific decadal oscillation index
Artic oscillation
Source
License

Nenana Ice Classic

Ice thickness measurements
Ice break up dates

1.2 Loading data#

If you are new to Python you are probably aware of some basic data structures like lists and dictionaries, maybe even NumPy arrays.

We will now introduce a new data structure, DataFrames.

DataFrames#

A DataFrame is tabular data structure that stores information in rows and columns, where each column has a name and can hold different types of data. Both the rows and columns can have labels, making it easier to access and manipulate specific part of the data (very colloquially, think of a dataframe as a mix between dictionaries and a spreadsheets)

In Python, we will use Pandas.Dataframes (quick introduction)

DataFrames are particularly useful because we can easily manipulate data using a wide variety built-in methods. Additionally, pandas.dataframe can easily handle data with timestamps, meaning, that we can assign a date and time to each observation and manipulate these variables without worrying about time zones, leap years, or other time-related issues. For example, you can subtract one date from another to find the number of days between them, access data for specific years or months, iterate through years, etc.

In addition to rows and columns, dataframe have an .index.The index is series of labels associated to each row, by default the index assign an integer, starting from 0, to each row. When working with time-series it may be convenient to use a DatetimeIndex to assign a timestamp label to each row instead.

Task 1:

The files corresponding to the sources mentioned above are available in the repository in the folder data/raw_files, choose a file, load the contents to a Pandas.DataFrame, set the index to a datetime object and plot the timeseries.

Tips/Help

Read the documentation for pd.read_csv. Pay special attention to the following methods and arguments

pd.read_csv()
- skip_row
- sep
- index_col
pd.index=pd.to_datetime()
- format

If you downloaded this notebook, you will need to download the file from the repository and pass the filepath to pd.read_csv. If you cloned the whole repository, the data files can be easily loaded using the relative path ../../data/raw_files/file_name.txt If you are running the notebook in your browser, use the function import_data_browser from the ice package to load the file into the browser. The function needs the URL of the file within the repository, specifically the raw URL. To obtain this, navigate to the file in the repository, and in the upper right-hand side, there is a button named Raw that will direct you to a page with the raw contents of the text file. The URL of this page is the URL that you need to pass to the function.

# EX 1.
# # Exercise 1.

import pandas as pd
import matplotlib.pyplot as plt
import iceclassic as ice



#Temp=pd.read_csv('../../data/raw_files/Berkeley temp.txt',skiprows=23,sep='\t',index_col=0)
file1=ice.import_data_browser('https://raw.githubusercontent.com/iceclassic/mude/main/book/data_files/Berkeley%20temp.txt')
Temp=pd.read_csv(file1,skiprows=23,sep='\t',index_col=0)

Temp.index=pd.to_datetime(Temp.index,format='%Y%m%d')

plt.figure(figsize=(20,5))
plt.plot(Temp)
plt.title("Temperature Data")
plt.xlabel("Date")
plt.ylabel("Temperature (C)")
plt.show()

../_images/c017aaf17442a1c77d52857c467510d14f1e4cdae1465d7f00654a6d2e298a51.png

Task 2:

Choose another file, load it and merge it to the previously loaded file, make sure that the resulting DataFrame contains two columns and a datetime index.

Tips/Help

Read the documentation for pd.merge.

If we merge two dataframe (df1, df2) with datetime indexes, the observations (rows) kept in the resulting dataframe depend on the type merge, this is determine by the argument how.

- `how='inner'`: Only observation associted with dates present in both df1 and df2 are kept.
- `how='left'`:  All observation from df1 are kept. Observation in df2 with dates that don’t match dates in df1 are filled with NA.
- `how='right'`: Similar to `how='left'`, but observations from second df are kept.
- `how='outer'`: All observation from both df1 and df2 are kept. Observation associated with non-matching dates are filled with NA.

# # Exercise 2.
# Loading 2nd file
    
#PDO=pd.read_csv('../../data/raw_files/PDO.csv',skiprows=1,index_col=0,sep=';')
file2=ice.import_data_browser('https://raw.githubusercontent.com/iceclassic/mude/main/book/data_files/PDO.csv')
PDO=pd.read_csv(file2,skiprows=1,index_col=0,sep=';')
PDO.index=pd.to_datetime(PDO.index,format='%Y%m')

# Merging
merged=pd.merge(Temp,PDO,left_index=True,right_index=True,how='outer')
merged.info()

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 46500 entries, 1854-01-01 to 2020-02-01
Data columns (total 2 columns):
 #   Column                                     Non-Null Count  Dtype  
---  ------                                     --------------  -----  
 0   # TAVG [degree C] Air Surface Temperature  46018 non-null  float64
 1   Value                                      1994 non-null   float64
dtypes: float64(2)
memory usage: 1.1 MB

1.3 Basic DataFrame manipulation#

The function pd.read_csv can loading most data files. However, dealing with files with varying use of separators, delimiters, data types and formats can be a time-consuming.

To simplify the process of loading and merging multiple DataFrames individually, all data sources have been merged into a single text file that can be easily loaded. The file can be found in data/Time_series_DATA.txt

The file has more than 20 variables (columns), some of which have daily observations spanning more than a century. To contextualize this variable the following interactive map shows the weather station where they were measured.

upload the files to github for this map, as the functions allows to choose the detail level we need a bunch of shapefiles

#ice.plot_interactive_map(plot_only_nearby_basin=False)

---------------------------------------------------------------------------
DataSourceError                           Traceback (most recent call last)
Cell In[5], line 1
----> 1 ice.plot_interactive_map(plot_only_nearby_basin=False)

File c:\Users\gabri\anaconda3\envs\ice_package_test\Lib\site-packages\iceclassic\example.py:1180, in plot_interactive_map(Pfafstetter_levels, plot_only_nearby_basin)
# changing the level to higher number yield more basin, using Pfafstetter levels 1-12 source HydroBASINS
file='../../data/shape_files/hybas_lake_ar_lev'+'{:02d}'.format(Pfafstetter_levels)+'_v1c.shp'
-> 1180 gdf_basin_lev = gpd.read_file(file)
if plot_only_nearby_basin:
   if Pfafstetter_levels==1:

File c:\Users\gabri\anaconda3\envs\ice_package_test\Lib\site-packages\geopandas\io\file.py:294, in _read_file(filename, bbox, mask, columns, rows, engine, **kwargs)
           from_bytes = True
if engine == "pyogrio":
--> 294     return _read_file_pyogrio(
       filename, bbox=bbox, mask=mask, columns=columns, rows=rows, **kwargs
   )
elif engine == "fiona":
   if pd.api.types.is_file_like(filename):

File c:\Users\gabri\anaconda3\envs\ice_package_test\Lib\site-packages\geopandas\io\file.py:547, in _read_file_pyogrio(path_or_bytes, bbox, mask, rows, **kwargs)
   warnings.warn(
       "The 'include_fields' and 'ignore_fields' keywords are deprecated, and "
       "will be removed in a future release. You can use the 'columns' keyword "
   (...)
       stacklevel=3,
   )
   kwargs["columns"] = kwargs.pop("include_fields")
--> 547 return pyogrio.read_dataframe(path_or_bytes, bbox=bbox, **kwargs)

File c:\Users\gabri\anaconda3\envs\ice_package_test\Lib\site-packages\pyogrio\geopandas.py:261, in read_dataframe(path_or_buffer, layer, encoding, columns, read_geometry, force_2d, skip_features, max_features, where, bbox, mask, fids, sql, sql_dialect, fid_as_index, use_arrow, on_invalid, arrow_to_pandas_kwargs, **kwargs)
if not use_arrow:
   # For arrow, datetimes are read as is.
   # For numpy IO, datetimes are read as string values to preserve timezone info
   # as numpy does not directly support timezones.
   kwargs["datetime_as_string"] = True
--> 261 result = read_func(
   path_or_buffer,
   layer=layer,
   encoding=encoding,
   columns=columns,
   read_geometry=read_geometry,
   force_2d=gdal_force_2d,
   skip_features=skip_features,
   max_features=max_features,
   where=where,
   bbox=bbox,
   mask=mask,
   fids=fids,
   sql=sql,
   sql_dialect=sql_dialect,
   return_fids=fid_as_index,
   **kwargs,
)
if use_arrow:
   meta, table = result

File c:\Users\gabri\anaconda3\envs\ice_package_test\Lib\site-packages\pyogrio\raw.py:196, in read(path_or_buffer, layer, encoding, columns, read_geometry, force_2d, skip_features, max_features, where, bbox, mask, fids, sql, sql_dialect, return_fids, datetime_as_string, **kwargs)
"""Read OGR data source into numpy arrays.

IMPORTANT: non-linear geometry types (e.g., MultiSurface) are converted
   (...)

"""
dataset_kwargs = _preprocess_options_key_value(kwargs) if kwargs else {}
--> 196 return ogr_read(
   get_vsi_path_or_buffer(path_or_buffer),
   layer=layer,
   encoding=encoding,
   columns=columns,
   read_geometry=read_geometry,
   force_2d=force_2d,
   skip_features=skip_features,
   max_features=max_features or 0,
   where=where,
   bbox=bbox,
   mask=_mask_to_wkb(mask),
   fids=fids,
   sql=sql,
   sql_dialect=sql_dialect,
   return_fids=return_fids,
   dataset_kwargs=dataset_kwargs,
   datetime_as_string=datetime_as_string,
)

File c:\Users\gabri\anaconda3\envs\ice_package_test\Lib\site-packages\pyogrio\_io.pyx:1239, in pyogrio._io.ogr_read()

File c:\Users\gabri\anaconda3\envs\ice_package_test\Lib\site-packages\pyogrio\_io.pyx:219, in pyogrio._io.ogr_open()

DataSourceError: ../../data/shape_files/hybas_lake_ar_lev04_v1c.shp: No such file or directory

Task 3:

Read the documentation of explore_data() from the iceclassic package and use it to explore the contents of Time_series_DATA.txt.

Decide which variables might not be relevant to the problem ? Explain.

#Data=pd.read_csv("../../data/Time_series_DATA.txt",skiprows=149,index_col=0,sep='\t')
file3=ice.import_data_browser('https://raw.githubusercontent.com/iceclassic/mude/main/book/data_files/time_serie_data.txt')
Data=pd.read_csv(file3,skiprows=162,index_col=0,sep='\t')   
Data.index = pd.to_datetime(Data.index, format="%Y-%m-%d")
ice.explore_contents(Data)

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 39309 entries, 1901-02-01 to 2024-02-06
Data columns (total 28 columns):
 #   Column                                             Non-Null Count  Dtype  
---  ------                                             --------------  -----  
 Regional: Air temperature [C]                      38563 non-null  float64
 Days since start of year                           38563 non-null  float64
 Days until break up                                38563 non-null  float64
 Nenana: Rainfall [mm]                              29547 non-null  float64
 Nenana: Snowfall [mm]                              19945 non-null  float64
 Nenana: Snow depth [mm]                            15984 non-null  float64
 Nenana: Mean water temperature [C]                 2418 non-null   float64
 Nenana: Mean Discharge [m3/s]                      22562 non-null  float64
 Nenana: Air temperature [C]                        31171 non-null  float64
 Fairbanks: Average wind speed [m/s]                9797 non-null   float64
Fairbanks: Rainfall [mm]                           29586 non-null  float64
Fairbanks: Snowfall [mm]                           29586 non-null  float64
Fairbanks: Snow depth [mm]                         29555 non-null  float64
Fairbanks: Air Temperature [C]                     29587 non-null  float64
IceThickness [cm]                                  461 non-null    float64
Regional: Solar Surface Irradiance [W/m2]          86 non-null     float64
Regional: Cloud coverage [%]                       1463 non-null   float64
Global: ENSO-Southern oscillation index            876 non-null    float64
Gulkana Temperature [C]                            19146 non-null  float64
Gulkana Precipitation [mm]                         18546 non-null  float64
Gulkana: Glacier-wide winter mass balance [m.w.e]  58 non-null     float64
Gulkana: Glacier-wide summer mass balance [m.w.e]  58 non-null     float64
Global: Pacific decadal oscillation index          1346 non-null   float64
Global: Artic oscillation index                    889 non-null    float64
Nenana: Gage Height [m]                            4666 non-null   float64
IceThickness gradient [cm/day]: Forward            426 non-null    float64
ceThickness gradient [cm/day]: Backward            426 non-null    float64
ceThickness gradient [cm/day]: Central             391 non-null    float64
dtypes: float64(28)
memory usage: 8.7 MB

../_images/3550b49e0821b9c4a0b6947120a8d13d9b85b836eb660d74cc87c456e482a41b.png

../_images/17d9fae70b925c5fd7c11dc8015be5a925da2856b24a8ec39ec1e59c3a63c234.png

Task 4:

The file contains data for three distinct temperature timeseries, use compare_columns to visually compare them. Then, use .drop() to eliminate the columns that might be redundant .

temperature_columns=['Regional: Air temperature [C]','Nenana: Air temperature [C]','Fairbanks: Air Temperature [C]','Gulkana Temperature [C]']
ice.compare_columns(Data,temperature_columns)
Data=Data.drop(columns=temperature_columns[1:3])

../_images/e41f451ae77617b7758eb09b07f692b7b3f9bb3b999b0b6b827970ed8702e2c0.png

The contents of a DataFramecan be grouped into subsets of the dataframe by means of simple indexing.

There are three main ways to use indexing to create a subset of the dataframe

df.loc[]: uses the label of columns (column names) and rows (index) to select the subset
df.iloc[]: uses integers indexes to select the subset
df[]: uses the column name to select the column. Additionally we can pass a list/array of boolean values to use as a mask

Task 5:

Create a new dataframe with a subset containing only the data from 1950 onwards

Data_2 = Data[(Data.index.year >= 1950)]
Data_2.info()

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 27065 entries, 1950-01-01 to 2024-02-06
Data columns (total 26 columns):
 #   Column                                             Non-Null Count  Dtype  
---  ------                                             --------------  -----  
 Regional: Air temperature [C]                      26510 non-null  float64
 Days since start of year                           26510 non-null  float64
 Days until break up                                26510 non-null  float64
 Nenana: Rainfall [mm]                              22266 non-null  float64
 Nenana: Snowfall [mm]                              13322 non-null  float64
 Nenana: Snow depth [mm]                            12965 non-null  float64
 Nenana: Mean water temperature [C]                 2418 non-null   float64
 Nenana: Mean Discharge [m3/s]                      22562 non-null  float64
 Fairbanks: Average wind speed [m/s]                9797 non-null   float64
 Fairbanks: Rainfall [mm]                           22250 non-null  float64
Fairbanks: Snowfall [mm]                           22250 non-null  float64
Fairbanks: Snow depth [mm]                         22250 non-null  float64
IceThickness [cm]                                  461 non-null    float64
Regional: Solar Surface Irradiance [W/m2]          86 non-null     float64
Regional: Cloud coverage [%]                       876 non-null    float64
Global: ENSO-Southern oscillation index            876 non-null    float64
Gulkana Temperature [C]                            19146 non-null  float64
Gulkana Precipitation [mm]                         18546 non-null  float64
Gulkana: Glacier-wide winter mass balance [m.w.e]  58 non-null     float64
Gulkana: Glacier-wide summer mass balance [m.w.e]  58 non-null     float64
Global: Pacific decadal oscillation index          889 non-null    float64
Global: Artic oscillation index                    889 non-null    float64
Nenana: Gage Height [m]                            4666 non-null   float64
IceThickness gradient [cm/day]: Forward            426 non-null    float64
ceThickness gradient [cm/day]: Backward            426 non-null    float64
ceThickness gradient [cm/day]: Central             391 non-null    float64
dtypes: float64(26)
memory usage: 5.6 MB

Task 6

Create a new DataFrame which contains only the columns ['Regional: Air temperature [C]','Nenana: Rainfall [mm]'], for the years [2008,2010], from jan-01 to may-01.

Filter the original DataFrame using masks.

Tips/Help

Use df.index.year to create a mask that filters the years and df.index.strftime() for the mask pertaining to the dates.

cols=['Regional: Air temperature [C]','Nenana: Rainfall [mm]']
years=[2008,2009,2010]
date_1='01/01'
date_2='03/21'

year_mask = Data.index.year.isin(years) # (df.index.year >= min(years)) & (df.index.year <= max(years)) is also an option
date_mask = (Data.index.strftime('%m/%d') >= date_1) & (Data.index.strftime('%m/%d') <= date_2)
filtered_df = Data.loc[year_mask & date_mask, cols]

plt.figure(figsize=(20,5))
plt.plot(filtered_df.index,filtered_df,alpha=0.4,marker='o')
plt.xlabel("Date")
plt.ylabel("Values")
plt.title("Filtered DataFrame")
plt.legend(filtered_df.columns)
plt.show()

../_images/e3aba98123c8b6df808377af6205a0ad1e617b156a921075b50eb3c4e1b7e4b4.png

1.4 Interactive Plot#

The DataFrame has decades of observations, which causes the xaxis(date) of the plot to lose detail, we could make the figure larger, or plot subset of the dataframe, alternatively we can use plot_columns_interactive() to create a plot were we can scroll and zoom at specific dates.

::{card} Ex 9 Read the documentation for plot_columns_interactive() from the iceclassic package and create an interactive plot.

change the location of the file to the -github path- the function internally uses a file wiht the break up dates, eaither change it or extract this date from the the big dataframe with break_up_dates=Data.index[Data['Days until break up']==0]

column_groups = {
    'Group 1': ['Regional: Air temperature [C]','Gulkana Temperature [C]'],
    'Group 2': ['Nenana: Snow depth [mm]'],
    'Group 3': ['Nenana: Mean Discharge [m3/s]']}

# Plot the specified columns with default y_domains and focus on a specific date
#ice.plot_columns_interactive(Data, column_groups, title="Break up times & Global Variables at Tenana River-Nenana, AK with break-up dates")

Data

Contents

Data#

1.1 Data Sources#

1.2 Loading data#

DataFrames#

1.3 Basic DataFrame manipulation#

1.4 Interactive Plot#