Solving computation challenges in crop yield forecasting

Updated: Jun 25

Allard de Wit, Rob Knapen, Eliya Buyukkaya, Sander Janssen.

Wageningen Environmental Research, June 2021

Regional crop monitoring and yield forecasting is heavily based on the use of meteorological data, crop simulation modelling and statistics to forecast the expected crop yield and production in Europe and elsewhere in the world. With new data sources available through satellites observations and meteorological modelling there are opportunities to improve the predictive capabilities and apply the system for new purposes such as field level yield estimates. However, the computational challenges for creating such applications are large. The CYBELE project aims to bring the power of HPCs and Machine Learning to agriculture to improve existing applications and realize new ones.

Speeding up the current EU system

The current EU system for crop yield forecasting operated by the Joint Research Centre is heavily based on a relational database model. Although this approach is crucial for maintaining the integrity and consistency of the data it also becomes a bottleneck for processing crop simulations: The highly normalized nature of the data structure does not allow fast loading towards a data structure that are used for efficient distributed processing at compute clusters. Therefore, within the CYBELE project, we extended the relational database model with an approach that de-normalizes the static data in the relational model and stores this as JSON records in a NoSQL database (e.g. MongoDB). The contents of such a collection of JSON documents in the NoSQL database can then be efficiently loaded into Apache Spark data frames in the HPC.

Besides solving the loading of the data towards a distributed infrastructure, we developed an implementation of the WOFOST model that combines high numerical performance with the integration of Spark allowing it to be “applied” onto the Spark data frame. In this approach, a row in the Spark data frame is taken by the WOFOST model and the row contents are unpacked and used as input for running the simulation. Given the inherent distributed nature of Spark, this means that WOFOST simulations can be distributed across all the available cluster nodes. Outputs from the WOFOST simulations, are collected back into a Spark data frame which allows further post-processing or exporting (e.g. into a NoSQL database) for future analysis.

Performance gains

Together with UBITech we tested the performance of our system on several configurations ranging from high performance laptops to a small Kubernetes cluster in order to ensure that this system could robustly function under different architectures. Moreover, we compared it with the current approach used by JRC MARS which is based on directly loading from a relational database and a manual split of tasks across crop simulation processes. The table below demonstrates that the approach we developed is already 6x times faster in raw processing speed when run on equivalent hardware compared to the current JRC-MARS approach (green cells). This includes the export of JSON data from the relational database. On more modern hardware (Core i9 processor) a further 5x increase in speed is obtained.

Another important aspect of our approach is the ability to scale across multiple nodes. This was tested using a spark cluster running on top of Kubernetes (figure 1). The results demonstrate that the system has the ability to scale seamlessly over multiple nodes in the cluster which confirms our approach. In this particular test, the use of a sharded MongoDB database did not increase performance but we suspect that the test set was still too small for the sharding to be beneficial.

Figure 1. Scaling behaviour of WOFOST Spark, including the option to use a sharded MongoDB.


The results from our demonstrator show that our approach successfully increases performance for processing of crop simulations for operational yield forecasting. It demonstrates that our system already has a 6-fold increase in raw performance while having the ability to scale across compute nodes leveraging the underlying Spark system. Since most of the data in the MongoDB database is static, only updates to the weather data are required in an operational setting which makes it easy to manage and operationalize. Further research is now focused at studying the scaling behaviour and finding efficient ways to post-process the model outputs.