CrocoLake

CrocoLake contains observations from different databases, converted and stored in a uniform schema in parquet format. This documentation describes both the database schema and how each dataset is brought into CrocoLake.

CrocoLake comes in two versions:

  • PHY: contains physical measurements (temperature, salinity, pressure);

  • BGC: contains physical and biogeochemical measurements (e.g. temperature, dissolved oxygent, chlorophyll, etc.).

CrocoLake’s conventions

The naming convention is largely based on Argo’s convention.

Variables

Variables and their units are listed below with their presence in PHY and/or BGC CrocoLake versions.

Variable name

Long name

Units

dtype

PHY

BGC

DB_NAME

Database name (e.g. Argo, GLODAP, etc.)

string

LATITUDE

Latitude

degree_north

float64

LONGITUDE

Longitude

degree_east

float64

PRES

Pressure

dbar

float32

JULD

Timestamp

days since 1950-01-01 00:00:00 UTC

timestamp[ns]

TEMP

Temperature

degree_Celsius

float32

PSAL

Practical salinity

psu

float32

DOXY

Dissolved oxygen

micromole/kg

float32

✖️

BBP470

470nm particle backscattering

m-1

float32

✖️

BBP532

532nm particle backscattering

m-1

float32

✖️

BBP700

700nm particle backscattering

m-1

float32

✖️

TURBIDITY

Turbidity

ntu

float32

✖️

CP660

660nm particle beam attenuation

m-1

float32

✖️

CHLA

Chlorophyll-A

mg/m3

float32

✖️

CDOM

Concentration of coloured dissolved organic matter in sea water

ppb

float32

✖️

NITRATE

Nitrate NO3

micromole/kg

float32

✖️

BISULFIDE

Bisulfide

micromole/kg

float32

✖️

PH_IN_SITU_TOTAL

pH

dimensionless

float32

✖️

DOWN_IRRADIANCE380

380nm downwelling irradiance

W/m^2/nm

float32

✖️

DOWN_IRRADIANCE412

412nm downwelling irradiance

W/m^2/nm

float32

✖️

DOWN_IRRADIANCE443

443nm downwelling irradiance

W/m^2/nm

float32

✖️

DOWN_IRRADIANCE490

490nm downwelling irradiance

W/m^2/nm

float32

✖️

DOWN_IRRADIANCE555

555nm downwelling irradiance

W/m^2/nm

float32

✖️

UP_RADIANCE380

380nm upwelling radiance

W/m^2/nm

float32

✖️

UP_RADIANCE412

412nm upwelling radiance

W/m^2/nm

float32

✖️

UP_RADIANCE443

443nm upwelling radiance

W/m^2/nm

float32

✖️

UP_RADIANCE490

490nm upwelling radiance

W/m^2/nm

float32

✖️

UP_RADIANCE555

555nm upwelling radiance

W/m^2/nm

float32

✖️

DOWNWELLING_PAR

Downwelling photosynthetic available radiation

microMoleQuanta/m^2/sec

float32

✖️

CFC11

CFC-11 Trichlorofluoromethane

picomole/kg

float32

✖️

CFC12

CFC-12 Dichlorodifluoromethane

picomole/kg

float32

✖️

CFC113

CFC-113 trichlorotrifluoroethane

picomole/kg

float32

✖️

SILICATE

Silicate

micromole/kg

float32

✖️

PHOSPHATE

Phosphate

micromole/kg

float32

✖️

TCO2

TCO2 (dissolved inorganic carbon)

micromole/kg

float32

✖️

TOT_ALKALINITY

Total alkalinity

micromole/kg

float32

✖️

CCL4

Carbon tetrachloride (CCl4)

picomole/kg

float32

✖️

SF6

Sulfur hexafluoride (SF6)

femtomole/kg

float32

✖️

Important notes:

  • To each measured parameter <PARAM> also correspond a quality-control flag and an error variable, called <PARAM>_QC (uint8) and <PARAM>_ERROR (float32) respectively. This applies to all variables except LATITUDE, LONGITUDE, JULD, and DB_NAME.

  • For the physical version, the variable DATA_MODE (string) reports the recording mode for Argo’s measurements (‘R’ for real time, ‘A’ adjusted, ‘D’ delayed). This choice is to privilege the presence of the best measurements available, although the user should proceed with care when using real-time data for scientific analysis (see here and also here).

  • For the same purpose, the BGC version contains the <PARAM>_DATA_MODE variable for each measured parameter (except LATITUDE, LONGITUDE, JULD).

  • The database is stored as a table, where each row contains measurements at a point in the <DB_NAME, LATITUDE, LONGITUDE, JULD, PRES> space. If a variable was not measured at a point, it is generally set as missing using pandas’ NA dtype. This is great because it allows a consistent treatment of missing data across data types when generating CrocoLake. At the same time, when reading CrocoLake, each language and package deals with missing data in its own way, and it is recommended that you familiarize with the tools you are using, so that you know how they perform stastics when missing data are included (relevant discussions for pandas here, and for dask here and here).

  • The data is generally stored using pyarrow as backend, and the variables’ data type are then pyarrow’s implementation. This should be of little relevance (if any) to most users, as common data analysis packages like pandas and dask are compatible with pyarrow’s data types (and can convert them to numpy dtypes if needed). Matlab and Julia also appear to not have any issues with this.

Quality-control

CrocoLake only contains quality-controlled measurements, allowing you to focus on the analysis steps.

Some notes:

  • The quality-control flag <PARAM>_QC generally carries on the ‘good data’ value from the original source; when missing in the original source, it is set to 1 if the source data is deemed good (bad data are discarded).

  • For how quality-control is performed on each dataset, see the dedicated page.

  • Not all measurements have error estimates <PARAM>_ERROR, even if they are deemed of good quality. In these cases, the corresponding <PARAM>_ERROR value is represented as missing.