Assuring high quality of data in CYBELE
CYBELE aims to handle numerous datasets of different volumes, types and formats coming from many different sources, such as agriculture, aquaculture, livestock farming, meteorology and others. In order to get meaningful results with added value from the analysis of these datasets, the high quality of the data is more than important. In this context, datasets that are collected from CYBELE will undergo various processes before their storage in CYBELE repositories for analytical and other purposes. The bouquet of these processes comprises the Data Check-In phase of the CYBELE platform.
More specifically, in the beginning, a dataset will be ingested either in real time or in batch form through different ways. It may indicatively be collected through a web link, an API or be uploaded as a file by its data provider. The attributes of the dataset will be mapped to the Common Semantic Model, which constitutes a common language among the different datasets residing in CYBELE. Moreover, smart linking with other datasets in CYBELE platform may be indirectly applied through the model in order to create new even more enriched datasets.
As a dataset may comply with data privacy legislations, it shall also pass through anonymization checks, when necessary, in order to protect any sensitive or personal data. Many different anonymization methods can be applied depending of the needs of the data provider, such as generalization, pseudo-anonymization, permutation, perturbation and differential privacy.
Afterwards, the dataset will go through the quality checks where it will be cross-checked against certain validation rules. Practically, the dataset will undergo a curation process if any errors, like missing or incorrect values, have been found during the previous quality checks, according to the provider’s preferences. Such curation methods that address these errors are data replacement, data reduction, data normalization, missing data imputation, duplicate data elimination and data parsing and increase the data quality in terms of completeness, consistency, conformity, accuracy, integrity and timeliness.
The data provider will then assign the metadata of the dataset that refer to its licensing details and corresponding data access policy. In addition, the dataset will be enriched with additional attributes that will help subsequently in the analytical processes.
Finally, the collected dataset will be registered in the CYBELE platform, facilitating analytical purposes through the guaranteed top quality of its data after passing through all the previous steps. As the Data Check-In phase is a lifecycle procedure, any future modifications on the data and/or metadata of the dataset ingested in CYBELE will trigger the necessary Data Check-in processes in order to accommodate these changes and manage the versioning of the dataset, ensuring in that way its veracity and timeliness.