1. Climate Informatics Claire Monteleoni, Department of Computer Science, George Washington University

Data Challenges and Opportunities in Climate Informatics

Download 130.98 Kb.

Page	8/9
Date	18.10.2016
Size	130.98 Kb.
	#1059

1 2 3 4 5 6 7 8 9

1.11.2 Climate System Complexity
1.11.3 Challenge: Cloud-based Reproducible Climate Data Analysis

1.11 Data Challenges and Opportunities in Climate Informatics

1.11.1 Issues with Cross-Class Comparisons

There is often a need to compare across different classes of data, whether to provide ground truth for a satellite retrieval or to evaluate a climate model prediction or to calibrate a proxy measurement. But because of the different characteristics of the data, comparing 'apples to apples' can be difficult.

One of the recurring issues is the difference between internal variability (or weather) and climate responses tied to a specific external forcing. The internal variability is a function of the chaotic dynamics in the atmosphere, and can't be predicted over time periods longer than 10 days or so. This variability, which can exist on all time scales, exists also in climate models, but because of the sensitive dependence on initial conditions, any unique simulation will have a different realization of the internal variability. Climate changes are then effectively defined as the ensemble mean response (i.e. after averaging out any internal variability). Thus any single realization (such as the real world record) must be thought of as a forced signal (driven by external drivers) combined with a stochastic component of the weather.

The internal variability increases in relative magnitude as a function of decreasing time or spatial scale. Thus comparisons of the specific time evolution of the climate system need to either take the variability into account, or use specific techniques to minimize the difference from the real world. For instance, 'nudged' simulations use observed winds from the reanalyses to keep the weather in the model loosely tied to the observations. Simulations using the observed ocean temperatures as a boundary condition can do a good job at synchronizing the impacts of variability in the ocean on the atmospheric fields. Another way to minimize the impact of internal variability, is to look for property correlations to focus on specific processes which though they may happen at different points in time or space, can nonetheless be compared across models and observations.

Another issue is that model output does not necessarily represent exact topography/conditions related to an in-situ observation. The average height of a specific grid box might not correspond to the height of a mountain-based observing platform, or the resolved shape of the coastline might make a difference of a 200 km or so in the distance of a station to the shore. These issues can be alleviated to some extent if comparisons are focused on large scale gridded data. Another technique is to 'downscale' the model output to specific locations, either statistically (based on observed correlations of a local record to larger scale features of the circulation), or dynamically (using an embedded RCM). These methods have the potential to correct for biases in the large scale model, but many practical issues remain in assessing by how much.

Finally, observations are a function of a specific observing methodology, which encompasses technology, practice and opportunity. These factors can impart a bias or skewness to the observation relative to what the real world may nominally be doing. Examples in satellite remote sensing are common - a low cloud record from a satellite will only be able to see low clouds when there are no high clouds for instance. Similarly, a satellite record of 'mid-tropospheric' temperatures might actually be a weighted integral of temperatures from the surface to the stratosphere. A paleo-climate record may be of a quantity that while related to temperature or precipitation, may be a complex function of both, weighted towards a specific season. In all these cases, it is often advisable to create a 'forward model' of the observational process itself to post-process the raw simulation output to create more commensurate diagnostics.

1.11.2 Climate System Complexity

A further issue arises in creating statistical models of the climate system because both the real world and dynamical models have a large number of different physical variables.

Even simplified models can have hundreds of variables, and while not all of them are essential to determining the state of the system, one variable is frequently not sufficient. Land,

Air, and Ocean processes all have different dominant time scales, and thus different components are essential at different scales. Some physical understanding is thus necessary to make the proper variable/data choices, even with analysis schemes that extract structure from large datasets. Furthermore, these systems are chaotic, i.e. initial conditions that are indistinguishable from each other in any given observing system will diverge greatly from each other on some short timescale. Thus extracting useful predictions requires more than creating more accurate models – one needs to determine what aspects are predictable and which are not.

1.11.3 Challenge: Cloud-based Reproducible Climate Data Analysis

The study of science requires reproducible results: science is a body of work where the community strives to insure that results are not from the unique abilities and circumstances of one particular person or group. Traditionally this has been done in large part by publishing papers, but the scale of modern climate modeling and data analysis efforts has far outstripped the ability of a journal article to convey enough information to allow reproducibility. This is an issue both of size and of complexity: model results are much larger than can be conveyed in a few pages, and both models and analysis procedures are too complex to be adequately described in a few pages.

The sheer size of GCM and satellite datasets are also outstripping our traditional data storage and distribution methods: frequently only a few variables from a model's output are saved and distributed at high resolution, and the remaining model output is heavily averaged to generate datasets that are sufficiently small.

One promising approach to addressing these problems is cloud-based reproducible climate data analysis. Having both the data and the analyses resident in the computational cloud allows the details of the computation to be hidden from the user, so, for example, data-intensive portions of the computation could be executed close to where the data resides. But these analyses must be reproducible, which brings not only technical challenges of archiving and finding/describing/publishing analysis procedures, but also institutional challenges of ensuring that the large datasets that form the basis of these analyses remain accessible.

Data Scale. The size of datasets is rapidly outstripping our ability to store and serve the data. We have difficulty storing even a single copy of the complete model results, and making complete copies of those results and distributing them for analysis becomes both a large undertaking and limits the analysis to the few places that have data storage facilities of that scale. Analysis done by the host prior to distribution, such as averaging, reduces the size to something more manageable, but currently those reductions are chosen far in advance, and there are many other useful analyses.

A cloud-based analysis framework would allow such reductions to be chosen and still executed on machines with fast access to the data.

Reproducibility and Provenance Graphs. A cloud-based analysis framework would have to generate reproducible documented results, i.e. we would not only need the ability to rerun a calculation and know that it would generate the same results, but also know precisely what analysis had been done. This could be achieved in part by having standardized analysis schemes, so that one could be sure precisely what was calculated in a given data filter, but also important is systematically tracking the full provenance of the calculation. This provenance graph, showing the full network of data filters and initial, intermediate, and final results, would provide the basis of both reproducibility and communication of results: the provenance graphs provide the information necessary to rerun a calculation and get the same results; they also provide the basis of the full documentation of the results. This full network would have to have layers of abstraction so that the reader could start with an overall picture and then proceed to more detailed versions as needed.
1.12 Conclusion

The goal of this chapter was to inspire future work in the nascent field of Climate Informatics. We hope that this chapter will inspire work not only on some of the challenge problems proposed here but also on new problems. A profuse amount of climate data of various types is available, providing a rich and fertile playground for future data mining and machine learning research. Even exploratory data analysis could prove useful for accelerating discovery in Climate Informatics. To that end, we have prepared a Climate Informatics wiki as a result of the First International Workshop on Climate Informatics, with climate data links with descriptions, challenge problems, and tutorials on machine learning techniques [14]. There are a myriad of collaborations possible at the intersection of climate science and data mining, machine learning, and statistics. We hope our work will encourage progress on a range of emerging problems in Climate Informatics.

Acknowledgements

The First International Workshop on Climate Informatics served as an inspiration for this chapter, and some of these topics were discussed there. The workshop sponsors were: LDEO/GISS Climate Center, Columbia University; Information Science and Technology Center, Los Alamos National Laboratory; NEC Laboratories America, Department of Statistics, Columbia University; Yahoo! Labs; The New York Academy of Sciences.

KS was supported in part by NSF Grant 1029711.

MKT and MBB are supported by a grant/cooperative agreement from the National Oceanic and Atmospheric Administration (NOAA. NA05OAR4311004). The views expressed herein are those of the authors and do not necessarily reflect the views of NOAA or any of its sub-agencies.

AB was supported in part by NSF grants IIS-1029711, IIS-0916750, and IIS-0812183, and NSF CAREER award IIS-0953274.

ARG’s research reported here has been financially supported by the Oak Ridge National Laboratory and Northeastern University grants, as well as the National Science Foundation award 1029166, in addition to funding from the US Department of Energy and the Department of Science and Technology of the Government of India.

The work of JES was supported in part by NSF grant ATM0902436 and by NOAA grants NA07OAR4310060 and NA10OAR4320137.

MT would like to acknowledge the NSF grant ARC 0909388.

References

C. M. Ammann, F. Joos, D. S. Schimel, B. L. Otto-Bliesner, and R. A. Tomas. Solar inﬂuence on climate during the past millennium: Results from transient simulations with the NCAR Climate System Model. Proc. U. S. Natl. Acad. Sci., 104(10):3713-3718, 2007.
K. J. Anchukaitis, M. N. Evans, A. Kaplan, E. A. Vaganov, M. K. Hughes, H. D. Grissino-Mayer. Forward modeling of regional scale tree-ring patterns in the southeastern United States and the recent influence of summer drought, Geophys. Res. Lett., 33, L04705, doi:10.1029/2005GL025050.
A. G. Barnston and T. M. Smith. Specification and prediction of global surface temperature and precipitation from global SST using CCA. J. Climate, 9:2660–2697, 1996.
C. M. Bishop. Machine Learning and Pattern Recognition. Springer, 2007.
Christopher S. Bretherton, Catherine Smith, and John M. Wallace. An intercomparison of methods for finding coupled patterns in climate data. J. Climate, 5:541–560, 1992.
Brohan, P., J.J. Kennedy, I., Harris, S.F.B. Tett, and P.D. Jones. Uncertainty estimates in regional and global observed temperature changes: A new dataset from 1850. J. Geophys. Res. 111, D12106, 2006.
Buckley, B.M., K.J. Anchukaitis, D. Penny, et al. Climate as a contributing factor in the demise of Angkor, Cambodia. Proc. Nat. Acad. Sci. USA 107, 6748-6752, 2010.
S. J. Camargo and A. G. Barnston. Experimental seasonal dynamical forecasts of tropical cyclone activity at IRI. Wea. Forecasting, 24:472–491, 2009.
S. J. Camargo, A. W. Robertson, A. G. Barnston, and M. Ghil. Clustering of eastern North Pacific tropical cyclone tracks: ENSO and MJO effects. Geochem. Geophys. and Geosys., 9:Q06V05, 2008. doi:10.1029/2007GC001861.
M.A. Cane, S.E. Zebiak, and S.C. Dolan. Experimental forecasts of El Niño. Nature, 321:827–832, 1986.
Cesa-Bianchi, N. and G. Lugosi. Prediction, Learning and Games. Cambridge University Press, 2006.
V. Chandrasekaran, S. Sanghavi, P. Parril, and A. Willsky. Rank-sparsity incoherence for matrix decomposition. SIAM Journal of Optimization, 21(2), 2011.
B. Christiansen, T. Schmith, and P. Thejll. A surrogate ensemble study of climate reconstruction methods: Stochasticity and robustness. J. Climate, 22(4):951-976, 2009.
Climate Informatics wiki: http://sites.google.com/site/1stclimateinformatics/materials
Cook, E.R., R. Seager, M.A. Cane, and D.W. Stahle. North American drought: Reconstructions, causes, and consequences. Earth Science Reviews 81, 93-134, 2007.
Dee, D.P., S.M. Uppala, A.J. Simmons, et al. The ERA-Interim reanalysis: configuration and performance of the data assimilation system. Quart. J. Roy. Meteorol. Soc. 137, 553-597, 2011.
T. DelSole and M. K. Tippett. Average Predictability Time: Part I. Theory. J. Atmos. Sci., 66:1188-1204, 2009.
T. DelSole, M. K. Tippett, and J. Shukla. A significant component of unforced multidecadal variability in the recent acceleration of global warming. J. Climate, 24:909-926, 2011.
J. F. Donges, Y. Zou, N. Marwan, and J. Kurths. The backbone of the climate network. European Physics Letters, 87(4):48007, 2007.
R. Donner, S. Barbosa, J. Kurths, and N. Marwan. Understanding the Earth as a Complex System – recent advances in data analysis and modeling in Earth sciences. European Physics Journal Special Topics, 174:1-9, 2009.
M. N. Evans, A. Kaplan, and M. A. Cane. Pacific sea surface temperature field reconstruction from coral δ18O data using reduced space objective analysis. Paleoceanography, 17, 2002.
M. N. Evans, B. K. Reichert, A. Kaplan, K. J. Anchukaitis, E. A. Vaganov, M. K. Hughes, and M. A. Cane. A forward modeling approach to paleoclimatic interpretation of tree-ring data. J. Geophys. Res., 111(G3), 2006.
J. A. Foley, M. T. Coe, M. Scheffer, and G. Wang. Regime Shifts in the Sahara and Sahel: Interactions between Ecological and Climatic Systems in Northern Africa. Ecosystems, 6:524-532, 2003.
Foster, G., J.D. Annan, G.A. Schmidt, and M.E. Mann. Comment on ”Heat capacity, time constant, and sensitivity of Earth’s climate system” by S.E. Schwartz. J. Geophys. Res. 113, D15102, 2008.
J. Friedman, T. Hastie, and R. Tibshirani. A note on the group lasso and a sparse group lasso. Preprint, 2010.
Getoor, L. and B. Tasker (eds). Introduction to Statistical Relational Learning. MIT Press, 2007.
Gifford, C.M. Collective Machine Learning: Team Learning and Classification in Multi-Agent Systems. PhD Dissertation, University of Kansas, 2009.
F. Giorgi and N. Diffenbaugh. Developing regional climate change scenarios for use in assessment of effects on human health and disease. Clim. Res., 36:141-151, 2008.
G. H. Golub and C. F. Van Loan. Matrix Computations. The Johns Hopkins University Press, Washington D.C., third edition, 1996.
J F González-Rouco, H. Beltrami, E. Zorita, and H. Von Storch. Simulation and inversion of borehole temperature proﬁles in surrogate climates: Spatial distribution and surface coupling. Geophys. Res. Lett., 33(1), 2006.
W.M. Gray. Atlantic seasonal hurricane frequency. PartI: El-Nin ̃o and 30-MB quasi-biennial oscillation influences. Mon. Wea. Rev, 112:1649–1688, 1984.
Arthur M. Greene, Andrew W. Robertson, Padhraic Smyth, and Scott Triglia. Downscaling forecasts of Indian monsoon rainfall using a nonhomogeneous hidden Markov model. Quart. J. Royal Meteor. Soc., 137:347–359, 2011.
Hansen, J., R. Ruedy, M. Sato, and K. Lo. Global surface temperature change. Rev. Geophys.48, RG4004, 2010.
T. Hastie, R. Tibshirani, and J. Friedman. The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Springer, 2001.
Hegerl, G.C., T.J. Crowley, M. Allen, et al. Detection of human influence on a new, validated 1500-year temperature reconstruction. J. Climate 20, 650-666, 2007.
Hegerl, G.C., F.W. Zwiers, P. Braconnot, et al. Understanding and attributing climate change. Climate Change 2007: The Physical Science Basis. Contribution of Working Group I to the Fourth Assessment Report of the Intergovernmental Panel on Climate Change, S. Solomon, et al. (eds), Cambridge University Press, 2007.
M. Hoerling, J. Hurrell, J. Eischeid, and A. Phillips. Detection and Attribution of Twentieth-Century Northern and Southern African Rainfall Change. Journal of Climate, 19(16):3989-4008, August 2006.
Solomon M. Hsiang, Kyle C. Meng, and Mark A. Cane. Civil conflicts are associated with the global climate. Nature, 476:438–441, 2011.
IDAG (International ad hoc Detection and Attribution Group. Detecting and attributing external influences on the climate system: A review of recent advances. J. Clim. 18, 1291-1314, 2005.
IPCC (Intergovernmental Panel on Climate Change). Expert Meeting on Assessing and Combining Multi Model Climate Projections: Good Practice Guidance Paper on Assessing and Combining Multi Model Climate Projections, R. Knutti, et al., 2010.
Jones, P.D., K.R. Briﬀa, T.J. Osborn, et al. High-resolution palaeoclimatology of the last millennium: a review of current status and future prospects. The Holocene 19, 3-49, 2009.
Kaplan A., M.A. Cane, and Y. Kushnir. Reduced space approach to the optimal analysis interpolation of historical marine observations: Accomplishments, difficulties, and prospects. In Advances in the Applications of Marine Climatology: The Dynamic Part of the WMO Guide to the Applications of Marine Climatology, pages 199-216, Geneva, Switzerland, 2003. World Meteorological Organization.
J. Kawale, S. Liess, A. Kumar, et al. Data guided discovery of dynamic dipoles. In Proceedings of the NASA Conference on Intelligent Data Understanding, 2011.
Keenlyside, N.S., M. Latif, J. Jungclaus, L. Kornblueh, and E. Roeckner. Advancing decadal-scale climate prediction in the North Atlantic Sector. Nature 453, 84-88, 2008.
Kennedy, J.J., N.A. Rayner, R.O. Smith, D.E. Parker, and M. Saunby. Reassessing biases and other uncertainties in sea surface temperature observations measured in situ since 1850: 1. Measurement sampling uncertainties. J. Geophys. Res. 116, D14103, 2011.
Knutti, R., G.A. Meehl, M.R. Allen, and D.A. Stainforth. Constraining climate sensitivity from the seasonal cycle in surface temperature. J. Clim. 19, 4224-4233, 2006.
Koller, D. and N. Friedman. Probabilistic Graphical Models: Principles and Techniques. MIT Press, 2009.
Y. Koren, R. M. Bell, and C. Volinsky. Matrix factorization techniques for recommender systems. IEEE Computer, 42(8):30-37, 2009.
R Sari Kovats, Menno J Bouma, Shakoor Hajat, Eve Worrall, and Andy Haines. El Niño and health. The Lancet, 362:1481–1489, 2003.
V. M. Krasnopolsky and M. S. Fox-Rabinovitz. Complex hybrid models combining deterministic and machine learning components for numerical climate modeling and weather prediction. Neural Networks,19(2):122–134, 2006.
Lee, T.C.,K., F.W. Zwiers, and M. Tsao. Evaluation of proxy-based millennial reconstruction methods. Climate Dyn. 31, 263-281, 2008.
B. Li, D.W. Nychka, and C.M. Ammann. The value of multiproxy reconstruction of past climate. J. Am. Stat. Assoc., 105:883–895, 2010.
Carlos H. R. Lima, Upmanu Lall, Tony Jebara, and Anthony G. Barnston. Statistical prediction of ENSO from subsurface sea temperature using a nonlinear dimensionality reduction. J. Climate, 22:4501–4519, 2009.
Lu, Z. and T.K. Leen. Semi-supervised Learning with Penalized Probabilistic Clustering. In Advances of Neural Information Processing System, MIT Press, 2005.
M. E. Mann, R. S. Bradley, and M. K. Hughes. Northern hemisphere temperatures during the past millennium: Inferences, uncertainties, and limitations. Geophys. Res. Lett., 26:759-762, 1999.
M. E. Mann, S. Rutherford, E. Wahl, and C. Ammann. Testing the fidelity of methods used in proxy-based reconstructions of past climate. J. Climate, 18:4097-4107, 2005.
M. E. Mann, S. Rutherford, E. Wahl, and C. Ammann. Robustness of proxy-based climate field reconstruction methods. J. Geophys. Res., 112(D12109), 2007.
Mann, M.E., Z. Zhang, M.K. Hughes, et al. Proxy-based reconstructions of hemispheric and global surface temperature variations over the past two millennia. Proc. Nat. Acad. Sci. USA 105, 13252-13257, 2008.
Mann, M.E., Z. Zhang, S. Rutherford, et al. Global signatures and dynamical origins of the Little Ice Age and Medieval Climate Anomaly. Science 326, 1256-1260, 2009.
Mearns, L.O., W.J. Gutowski, R. Jones, et al. A regional climate change assessment program for North America. EOS 90, 311-312, 2009.
Meehl, G.A., T.F. Stocker, W.D. Collins, et al. Global climate projections. Climate Change 2007: The Physical Science Basis. Contribution of Working Group I to the Fourth Assessment Report of the Intergovernmental Panel on Climate Change, S. Solomon, et al. (eds), Cambridge University Press, 2007.
Menne, M.J., C.N. Williams Jr., and M.A. Palecki. On the reliability of the U.S. surface temperature record. J. Geophys. Res. 115, D11108, 2010.
Monteleoni, C., G.A. Schmidt, S. Saroha, and E. Asplund. Tracking climate models. Statistical Analysis and Data Mining 4, 372-392, 2011.
Murphy, J.M., B.B. Booth, M. Collins, et al. A methodology for probabilistic predictions of regional climate change from perturbed physics ensembles. Phil. Trans. Roy. Soc. A 365, 2053-2075, 2007.
G. T. Narisma, J. A. Foley, R. Licker, and N. Ramankutty. Abrupt changes in rainfall during the twentieth century. Geophysical Research Letters, 34:L06710, March 2007.
S. Negahban, P. Ravikumar, M. J. Wainwright, and B. Yu. A unified framework for high-dimensional analysis of m-estimators with decomposable regularizers. Arxiv, 2010. http://arxiv.org/abs/1010.2731v1.
Owhadi, H., J.C. Scovel, T. Sullivan, M. McKems, and M. Ortiz. Optimal Uncertainty Quantification, SIAM Review, 2011 (submitted).
Roger D. Peng, Jennifer F. Bobb, Claudia Tebaldi, Larry McDaniel, Michelle L. Bell, and Francesca Dominici. Toward a quantitative estimate of future heat wave mortality under global climate change. Environ Health Perspect, 119, 2010.
Powell, W.B., and P. Frazier. Optimal Learning. In Tutorials in Operations Research: State-of-the-art decision making tools in the Information Age. Hanover, MD, 2008.

Download 130.98 Kb.

Share with your friends:

1 2 3 4 5 6 7 8 9