Brief description of subject: In this internship project, in order to tackle the problem of efficiently inferring schemas from huge JSON collections we aim at using Spark, a recent system enabling general-purpose, large-scale data processing. Spark allows for running programs written in the Scala programming language, which is particularly suitable for symbolic manipulations performed by schema inference algorithms. Also, Spark outperforms Hadoop-MapReduce in many contexts, and we expect that this holds in our setting. Particular attention will be dedicated to the problem of inferring multiple schemas at different levels of precision, and let the user to decide the preferred precision level by, interactively, while exploring the data sets.
Link to details: http://www.lamsade.dauphine.fr/~colazzo/stages/json.pdf
Duration: 5-6 mois
Lead by: Dario Colazzo
Web page: http://www.lamsade.dauphine.fr/~colazzo/
Laboratory/Host Organisation: LAMSADE
Remarks: (Co-leader: Carlo Sartiani, Università della Basilicata (Italy))