Data#

Independently of the model, data will be needed for either creating or validating the model. The amount of data required will be determined by a trade-off between complexity-affordability-accuracy , and the type of model. Data-driven models will likely require a broad range and large volume of data, while mechanistic models may only need data for a defined set of variables.

1.1 Data Sources#

1.2 Loading data#

If you are new to Python you are probably aware of some basic data structures like lists and dictionaries, maybe even NumPy arrays.

We will now introduce a new data structure, DataFrames.

DataFrames#

A DataFrame is tabular data structure that stores information in rows and columns, where each column has a name and can hold different types of data. Both the rows and columns can have labels, making it easier to access and manipulate specific part of the data (very colloquially, think of a dataframe as a mix between dictionaries and a spreadsheets)

In Python, we will use Pandas.Dataframes (quick introduction)

DataFrames are particularly useful because we can easily manipulate data using a wide variety built-in methods. Additionally, pandas.dataframe can easily handle data with timestamps, meaning, that we can assign a date and time to each observation and manipulate these variables without worrying about time zones, leap years, or other time-related issues. For example, you can subtract one date from another to find the number of days between them, access data for specific years or months, iterate through years, etc.

In addition to rows and columns, dataframe have an .index.The index is series of labels associated to each row, by default the index assign an integer, starting from 0, to each row. When working with time-series it may be convenient to use a DatetimeIndex to assign a timestamp label to each row instead.

Task 1:

The files corresponding to the sources mentioned above are available in the repository in the folder data/raw_files, choose a file, load the contents to a Pandas.DataFrame, set the index to a datetime object and plot the timeseries.

# EX 1.
# # Exercise 1.

import pandas as pd
import matplotlib.pyplot as plt
import iceclassic as ice



#Temp=pd.read_csv('../../data/raw_files/Berkeley temp.txt',skiprows=23,sep='\t',index_col=0)
file1=ice.import_data_browser('https://raw.githubusercontent.com/iceclassic/mude/main/book/data_files/Berkeley%20temp.txt')
Temp=pd.read_csv(file1,skiprows=23,sep='\t',index_col=0)

Temp.index=pd.to_datetime(Temp.index,format='%Y%m%d')

plt.figure(figsize=(20,5))
plt.plot(Temp)
plt.title("Temperature Data")
plt.xlabel("Date")
plt.ylabel("Temperature (C)")
plt.show()
../_images/c017aaf17442a1c77d52857c467510d14f1e4cdae1465d7f00654a6d2e298a51.png

Task 2:

Choose another file, load it and merge it to the previously loaded file, make sure that the resulting DataFrame contains two columns and a datetime index.

# # Exercise 2.
# Loading 2nd file
    
#PDO=pd.read_csv('../../data/raw_files/PDO.csv',skiprows=1,index_col=0,sep=';')
file2=ice.import_data_browser('https://raw.githubusercontent.com/iceclassic/mude/main/book/data_files/PDO.csv')
PDO=pd.read_csv(file2,skiprows=1,index_col=0,sep=';')
PDO.index=pd.to_datetime(PDO.index,format='%Y%m')

# Merging
merged=pd.merge(Temp,PDO,left_index=True,right_index=True,how='outer')
merged.info()
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 46500 entries, 1854-01-01 to 2020-02-01
Data columns (total 2 columns):
 #   Column                                     Non-Null Count  Dtype  
---  ------                                     --------------  -----  
 0   # TAVG [degree C] Air Surface Temperature  46018 non-null  float64
 1   Value                                      1994 non-null   float64
dtypes: float64(2)
memory usage: 1.1 MB

1.3 Basic DataFrame manipulation#

The function pd.read_csv can loading most data files. However, dealing with files with varying use of separators, delimiters, data types and formats can be a time-consuming.

To simplify the process of loading and merging multiple DataFrames individually, all data sources have been merged into a single text file that can be easily loaded. The file can be found in data/Time_series_DATA.txt

The file has more than 20 variables (columns), some of which have daily observations spanning more than a century. To contextualize this variable the following interactive map shows the weather station where they were measured.

upload the files to github for this map, as the functions allows to choose the detail level we need a bunch of shapefiles

#ice.plot_interactive_map(plot_only_nearby_basin=False)
---------------------------------------------------------------------------
DataSourceError                           Traceback (most recent call last)
Cell In[5], line 1
----> 1 ice.plot_interactive_map(plot_only_nearby_basin=False)

File c:\Users\gabri\anaconda3\envs\ice_package_test\Lib\site-packages\iceclassic\example.py:1180, in plot_interactive_map(Pfafstetter_levels, plot_only_nearby_basin)
   1178 # changing the level to higher number yield more basin, using Pfafstetter levels 1-12 source HydroBASINS
   1179 file='../../data/shape_files/hybas_lake_ar_lev'+'{:02d}'.format(Pfafstetter_levels)+'_v1c.shp'
-> 1180 gdf_basin_lev = gpd.read_file(file)
   1181 if plot_only_nearby_basin:
   1182     if Pfafstetter_levels==1:

File c:\Users\gabri\anaconda3\envs\ice_package_test\Lib\site-packages\geopandas\io\file.py:294, in _read_file(filename, bbox, mask, columns, rows, engine, **kwargs)
    291             from_bytes = True
    293 if engine == "pyogrio":
--> 294     return _read_file_pyogrio(
    295         filename, bbox=bbox, mask=mask, columns=columns, rows=rows, **kwargs
    296     )
    298 elif engine == "fiona":
    299     if pd.api.types.is_file_like(filename):

File c:\Users\gabri\anaconda3\envs\ice_package_test\Lib\site-packages\geopandas\io\file.py:547, in _read_file_pyogrio(path_or_bytes, bbox, mask, rows, **kwargs)
    538     warnings.warn(
    539         "The 'include_fields' and 'ignore_fields' keywords are deprecated, and "
    540         "will be removed in a future release. You can use the 'columns' keyword "
   (...)
    543         stacklevel=3,
    544     )
    545     kwargs["columns"] = kwargs.pop("include_fields")
--> 547 return pyogrio.read_dataframe(path_or_bytes, bbox=bbox, **kwargs)

File c:\Users\gabri\anaconda3\envs\ice_package_test\Lib\site-packages\pyogrio\geopandas.py:261, in read_dataframe(path_or_buffer, layer, encoding, columns, read_geometry, force_2d, skip_features, max_features, where, bbox, mask, fids, sql, sql_dialect, fid_as_index, use_arrow, on_invalid, arrow_to_pandas_kwargs, **kwargs)
    256 if not use_arrow:
    257     # For arrow, datetimes are read as is.
    258     # For numpy IO, datetimes are read as string values to preserve timezone info
    259     # as numpy does not directly support timezones.
    260     kwargs["datetime_as_string"] = True
--> 261 result = read_func(
    262     path_or_buffer,
    263     layer=layer,
    264     encoding=encoding,
    265     columns=columns,
    266     read_geometry=read_geometry,
    267     force_2d=gdal_force_2d,
    268     skip_features=skip_features,
    269     max_features=max_features,
    270     where=where,
    271     bbox=bbox,
    272     mask=mask,
    273     fids=fids,
    274     sql=sql,
    275     sql_dialect=sql_dialect,
    276     return_fids=fid_as_index,
    277     **kwargs,
    278 )
    280 if use_arrow:
    281     meta, table = result

File c:\Users\gabri\anaconda3\envs\ice_package_test\Lib\site-packages\pyogrio\raw.py:196, in read(path_or_buffer, layer, encoding, columns, read_geometry, force_2d, skip_features, max_features, where, bbox, mask, fids, sql, sql_dialect, return_fids, datetime_as_string, **kwargs)
     56 """Read OGR data source into numpy arrays.
     57 
     58 IMPORTANT: non-linear geometry types (e.g., MultiSurface) are converted
   (...)
    191 
    192 """
    194 dataset_kwargs = _preprocess_options_key_value(kwargs) if kwargs else {}
--> 196 return ogr_read(
    197     get_vsi_path_or_buffer(path_or_buffer),
    198     layer=layer,
    199     encoding=encoding,
    200     columns=columns,
    201     read_geometry=read_geometry,
    202     force_2d=force_2d,
    203     skip_features=skip_features,
    204     max_features=max_features or 0,
    205     where=where,
    206     bbox=bbox,
    207     mask=_mask_to_wkb(mask),
    208     fids=fids,
    209     sql=sql,
    210     sql_dialect=sql_dialect,
    211     return_fids=return_fids,
    212     dataset_kwargs=dataset_kwargs,
    213     datetime_as_string=datetime_as_string,
    214 )

File c:\Users\gabri\anaconda3\envs\ice_package_test\Lib\site-packages\pyogrio\_io.pyx:1239, in pyogrio._io.ogr_read()

File c:\Users\gabri\anaconda3\envs\ice_package_test\Lib\site-packages\pyogrio\_io.pyx:219, in pyogrio._io.ogr_open()

DataSourceError: ../../data/shape_files/hybas_lake_ar_lev04_v1c.shp: No such file or directory

Task 3:

Read the documentation of explore_data() from the iceclassic package and use it to explore the contents of Time_series_DATA.txt.

Decide which variables might not be relevant to the problem ? Explain.

#Data=pd.read_csv("../../data/Time_series_DATA.txt",skiprows=149,index_col=0,sep='\t')
file3=ice.import_data_browser('https://raw.githubusercontent.com/iceclassic/mude/main/book/data_files/time_serie_data.txt')
Data=pd.read_csv(file3,skiprows=162,index_col=0,sep='\t')   
Data.index = pd.to_datetime(Data.index, format="%Y-%m-%d")
ice.explore_contents(Data)
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 39309 entries, 1901-02-01 to 2024-02-06
Data columns (total 28 columns):
 #   Column                                             Non-Null Count  Dtype  
---  ------                                             --------------  -----  
 0   Regional: Air temperature [C]                      38563 non-null  float64
 1   Days since start of year                           38563 non-null  float64
 2   Days until break up                                38563 non-null  float64
 3   Nenana: Rainfall [mm]                              29547 non-null  float64
 4   Nenana: Snowfall [mm]                              19945 non-null  float64
 5   Nenana: Snow depth [mm]                            15984 non-null  float64
 6   Nenana: Mean water temperature [C]                 2418 non-null   float64
 7   Nenana: Mean Discharge [m3/s]                      22562 non-null  float64
 8   Nenana: Air temperature [C]                        31171 non-null  float64
 9   Fairbanks: Average wind speed [m/s]                9797 non-null   float64
 10  Fairbanks: Rainfall [mm]                           29586 non-null  float64
 11  Fairbanks: Snowfall [mm]                           29586 non-null  float64
 12  Fairbanks: Snow depth [mm]                         29555 non-null  float64
 13  Fairbanks: Air Temperature [C]                     29587 non-null  float64
 14  IceThickness [cm]                                  461 non-null    float64
 15  Regional: Solar Surface Irradiance [W/m2]          86 non-null     float64
 16  Regional: Cloud coverage [%]                       1463 non-null   float64
 17  Global: ENSO-Southern oscillation index            876 non-null    float64
 18  Gulkana Temperature [C]                            19146 non-null  float64
 19  Gulkana Precipitation [mm]                         18546 non-null  float64
 20  Gulkana: Glacier-wide winter mass balance [m.w.e]  58 non-null     float64
 21  Gulkana: Glacier-wide summer mass balance [m.w.e]  58 non-null     float64
 22  Global: Pacific decadal oscillation index          1346 non-null   float64
 23  Global: Artic oscillation index                    889 non-null    float64
 24  Nenana: Gage Height [m]                            4666 non-null   float64
 25  IceThickness gradient [cm/day]: Forward            426 non-null    float64
 26  ceThickness gradient [cm/day]: Backward            426 non-null    float64
 27  ceThickness gradient [cm/day]: Central             391 non-null    float64
dtypes: float64(28)
memory usage: 8.7 MB
../_images/3550b49e0821b9c4a0b6947120a8d13d9b85b836eb660d74cc87c456e482a41b.png ../_images/17d9fae70b925c5fd7c11dc8015be5a925da2856b24a8ec39ec1e59c3a63c234.png

Task 4:

The file contains data for three distinct temperature timeseries, use compare_columns to visually compare them. Then, use .drop() to eliminate the columns that might be redundant .

temperature_columns=['Regional: Air temperature [C]','Nenana: Air temperature [C]','Fairbanks: Air Temperature [C]','Gulkana Temperature [C]']
ice.compare_columns(Data,temperature_columns)
Data=Data.drop(columns=temperature_columns[1:3])
../_images/e41f451ae77617b7758eb09b07f692b7b3f9bb3b999b0b6b827970ed8702e2c0.png

The contents of a DataFramecan be grouped into subsets of the dataframe by means of simple indexing.

There are three main ways to use indexing to create a subset of the dataframe

  • df.loc[]: uses the label of columns (column names) and rows (index) to select the subset

  • df.iloc[]: uses integers indexes to select the subset

  • df[]: uses the column name to select the column. Additionally we can pass a list/array of boolean values to use as a mask

Task 5:

Create a new dataframe with a subset containing only the data from 1950 onwards

Data_2 = Data[(Data.index.year >= 1950)]
Data_2.info()
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 27065 entries, 1950-01-01 to 2024-02-06
Data columns (total 26 columns):
 #   Column                                             Non-Null Count  Dtype  
---  ------                                             --------------  -----  
 0   Regional: Air temperature [C]                      26510 non-null  float64
 1   Days since start of year                           26510 non-null  float64
 2   Days until break up                                26510 non-null  float64
 3   Nenana: Rainfall [mm]                              22266 non-null  float64
 4   Nenana: Snowfall [mm]                              13322 non-null  float64
 5   Nenana: Snow depth [mm]                            12965 non-null  float64
 6   Nenana: Mean water temperature [C]                 2418 non-null   float64
 7   Nenana: Mean Discharge [m3/s]                      22562 non-null  float64
 8   Fairbanks: Average wind speed [m/s]                9797 non-null   float64
 9   Fairbanks: Rainfall [mm]                           22250 non-null  float64
 10  Fairbanks: Snowfall [mm]                           22250 non-null  float64
 11  Fairbanks: Snow depth [mm]                         22250 non-null  float64
 12  IceThickness [cm]                                  461 non-null    float64
 13  Regional: Solar Surface Irradiance [W/m2]          86 non-null     float64
 14  Regional: Cloud coverage [%]                       876 non-null    float64
 15  Global: ENSO-Southern oscillation index            876 non-null    float64
 16  Gulkana Temperature [C]                            19146 non-null  float64
 17  Gulkana Precipitation [mm]                         18546 non-null  float64
 18  Gulkana: Glacier-wide winter mass balance [m.w.e]  58 non-null     float64
 19  Gulkana: Glacier-wide summer mass balance [m.w.e]  58 non-null     float64
 20  Global: Pacific decadal oscillation index          889 non-null    float64
 21  Global: Artic oscillation index                    889 non-null    float64
 22  Nenana: Gage Height [m]                            4666 non-null   float64
 23  IceThickness gradient [cm/day]: Forward            426 non-null    float64
 24  ceThickness gradient [cm/day]: Backward            426 non-null    float64
 25  ceThickness gradient [cm/day]: Central             391 non-null    float64
dtypes: float64(26)
memory usage: 5.6 MB

Task 6

Create a new DataFrame which contains only the columns ['Regional: Air temperature [C]','Nenana: Rainfall [mm]'], for the years [2008,2010], from jan-01 to may-01.

Filter the original DataFrame using masks.

cols=['Regional: Air temperature [C]','Nenana: Rainfall [mm]']
years=[2008,2009,2010]
date_1='01/01'
date_2='03/21'

year_mask = Data.index.year.isin(years) # (df.index.year >= min(years)) & (df.index.year <= max(years)) is also an option
date_mask = (Data.index.strftime('%m/%d') >= date_1) & (Data.index.strftime('%m/%d') <= date_2)
filtered_df = Data.loc[year_mask & date_mask, cols]

plt.figure(figsize=(20,5))
plt.plot(filtered_df.index,filtered_df,alpha=0.4,marker='o')
plt.xlabel("Date")
plt.ylabel("Values")
plt.title("Filtered DataFrame")
plt.legend(filtered_df.columns)
plt.show()
../_images/e3aba98123c8b6df808377af6205a0ad1e617b156a921075b50eb3c4e1b7e4b4.png

1.4 Interactive Plot#

The DataFrame has decades of observations, which causes the xaxis(date) of the plot to lose detail, we could make the figure larger, or plot subset of the dataframe, alternatively we can use plot_columns_interactive() to create a plot were we can scroll and zoom at specific dates.

::{card} Ex 9 Read the documentation for plot_columns_interactive() from the iceclassic package and create an interactive plot.

change the location of the file to the -github path- the function internally uses a file wiht the break up dates, eaither change it or extract this date from the the big dataframe with break_up_dates=Data.index[Data['Days until break up']==0]

column_groups = {
    'Group 1': ['Regional: Air temperature [C]','Gulkana Temperature [C]'],
    'Group 2': ['Nenana: Snow depth [mm]'],
    'Group 3': ['Nenana: Mean Discharge [m3/s]']}

# Plot the specified columns with default y_domains and focus on a specific date
#ice.plot_columns_interactive(Data, column_groups, title="Break up times & Global Variables at Tenana River-Nenana, AK with break-up dates")