CrocoLake¶
CrocoLake contains observations from different databases, converted and stored in a uniform schema in parquet format. This documentation describes both the database schema and how each dataset is brought into CrocoLake.
CrocoLake comes in two versions:
PHY: contains physical measurements (temperature, salinity, pressure);
BGC: contains physical and biogeochemical measurements (e.g. temperature, dissolved oxygent, chlorophyll, etc.).
Download links¶
CrocoLake is refreshed every weekend. You can download the most recent versions at these links:
CrocoLake-PHY (for all tools except Matlab; approx. 17 GB)
CrocoLake-BGC (for all tools except Matlab; approx. 6 GB)
CrocoLake-PHY (for Matlab; approx. 17 GB; Matlab)
CrocoLake-BGC (for Matlab; approx. 6 GB)
The Matlab version is identical to the other versions, except a few metadata files are removed as they are incompatible with Matlab’s parser.
CrocoLake’s conventions¶
The naming convention is largely based on Argo’s convention.
Variables¶
Variables and their units are listed below with their presence in PHY and/or BGC CrocoLake versions.
Variable name |
Long name |
Units |
dtype |
PHY |
BGC |
|---|---|---|---|---|---|
DB_NAME |
Database name (e.g. Argo, GLODAP, etc.) |
string |
|||
LATITUDE |
Latitude |
degree_north |
float64 |
✅ |
✅ |
LONGITUDE |
Longitude |
degree_east |
float64 |
✅ |
✅ |
PRES |
Pressure |
dbar |
float32 |
✅ |
✅ |
JULD |
Timestamp |
days since 1950-01-01 00:00:00 UTC |
timestamp[ns] |
✅ |
✅ |
TEMP |
Temperature |
degree_Celsius |
float32 |
✅ |
✅ |
PSAL |
Practical salinity |
psu |
float32 |
✅ |
✅ |
DOXY |
Dissolved oxygen |
micromole/kg |
float32 |
✖️ |
✅ |
BBP470 |
470nm particle backscattering |
m-1 |
float32 |
✖️ |
✅ |
BBP532 |
532nm particle backscattering |
m-1 |
float32 |
✖️ |
✅ |
BBP700 |
700nm particle backscattering |
m-1 |
float32 |
✖️ |
✅ |
TURBIDITY |
Turbidity |
ntu |
float32 |
✖️ |
✅ |
CP660 |
660nm particle beam attenuation |
m-1 |
float32 |
✖️ |
✅ |
CHLA |
Chlorophyll-A |
mg/m3 |
float32 |
✖️ |
✅ |
CDOM |
Concentration of coloured dissolved organic matter in sea water |
ppb |
float32 |
✖️ |
✅ |
NITRATE |
Nitrate NO3 |
micromole/kg |
float32 |
✖️ |
✅ |
BISULFIDE |
Bisulfide |
micromole/kg |
float32 |
✖️ |
✅ |
PH_IN_SITU_TOTAL |
pH |
dimensionless |
float32 |
✖️ |
✅ |
DOWN_IRRADIANCE380 |
380nm downwelling irradiance |
W/m^2/nm |
float32 |
✖️ |
✅ |
DOWN_IRRADIANCE412 |
412nm downwelling irradiance |
W/m^2/nm |
float32 |
✖️ |
✅ |
DOWN_IRRADIANCE443 |
443nm downwelling irradiance |
W/m^2/nm |
float32 |
✖️ |
✅ |
DOWN_IRRADIANCE490 |
490nm downwelling irradiance |
W/m^2/nm |
float32 |
✖️ |
✅ |
DOWN_IRRADIANCE555 |
555nm downwelling irradiance |
W/m^2/nm |
float32 |
✖️ |
✅ |
UP_RADIANCE380 |
380nm upwelling radiance |
W/m^2/nm |
float32 |
✖️ |
✅ |
UP_RADIANCE412 |
412nm upwelling radiance |
W/m^2/nm |
float32 |
✖️ |
✅ |
UP_RADIANCE443 |
443nm upwelling radiance |
W/m^2/nm |
float32 |
✖️ |
✅ |
UP_RADIANCE490 |
490nm upwelling radiance |
W/m^2/nm |
float32 |
✖️ |
✅ |
UP_RADIANCE555 |
555nm upwelling radiance |
W/m^2/nm |
float32 |
✖️ |
✅ |
DOWNWELLING_PAR |
Downwelling photosynthetic available radiation |
microMoleQuanta/m^2/sec |
float32 |
✖️ |
✅ |
CFC11 |
CFC-11 Trichlorofluoromethane |
picomole/kg |
float32 |
✖️ |
✅ |
CFC12 |
CFC-12 Dichlorodifluoromethane |
picomole/kg |
float32 |
✖️ |
✅ |
CFC113 |
CFC-113 trichlorotrifluoroethane |
picomole/kg |
float32 |
✖️ |
✅ |
SILICATE |
Silicate |
micromole/kg |
float32 |
✖️ |
✅ |
PHOSPHATE |
Phosphate |
micromole/kg |
float32 |
✖️ |
✅ |
TCO2 |
TCO2 (dissolved inorganic carbon) |
micromole/kg |
float32 |
✖️ |
✅ |
TOT_ALKALINITY |
Total alkalinity |
micromole/kg |
float32 |
✖️ |
✅ |
CCL4 |
Carbon tetrachloride (CCl4) |
picomole/kg |
float32 |
✖️ |
✅ |
SF6 |
Sulfur hexafluoride (SF6) |
femtomole/kg |
float32 |
✖️ |
✅ |
Important notes:
To each measured parameter <PARAM> also correspond a quality-control flag and an error variable, called <PARAM>_QC (uint8) and <PARAM>_ERROR (float32) respectively. This applies to all variables except LATITUDE, LONGITUDE, JULD, and DB_NAME.
For the physical version, the variable DATA_MODE (string) reports the recording mode for Argo’s measurements (‘R’ for real time, ‘A’ adjusted, ‘D’ delayed). This choice is to privilege the presence of the best measurements available, although the user should proceed with care when using real-time data for scientific analysis (see here and also here).
For the same purpose, the BGC version contains the <PARAM>_DATA_MODE variable for each measured parameter (except LATITUDE, LONGITUDE, JULD).
The database is stored as a table, where each row contains measurements at a point in the <DB_NAME, LATITUDE, LONGITUDE, JULD, PRES> space. If a variable was not measured at a point, it is generally set as missing using pandas’ NA dtype. This is great because it allows a consistent treatment of missing data across data types when generating CrocoLake. At the same time, when reading CrocoLake, each language and package deals with missing data in its own way, and it is recommended that you familiarize with the tools you are using, so that you know how they perform stastics when missing data are included (relevant discussions for pandas here, and for dask here and here).
The data is generally stored using pyarrow as backend, and the variables’ data type are then pyarrow’s implementation. This should be of little relevance (if any) to most users, as common data analysis packages like pandas and dask are compatible with pyarrow’s data types (and can convert them to numpy dtypes if needed). Matlab and Julia also appear to not have any issues with this.
Quality-control¶
CrocoLake only contains quality-controlled measurements, allowing you to focus on the analysis steps.
Some notes:
The quality-control flag <PARAM>_QC generally carries on the ‘good data’ value from the original source; when missing in the original source, it is set to 1 if the source data is deemed good (bad data are discarded).
For how quality-control is performed on each dataset, see the dedicated page.
Not all measurements have error estimates <PARAM>_ERROR, even if they are deemed of good quality. In these cases, the corresponding <PARAM>_ERROR value is represented as missing.