@kiru not on my actual TODO list but maybe next year for qa catalogue: #K10plus is only a part of the ~225 million MARC records in K10plus Zentral: https://verbundwiki.gbv.de/display/VZG/K10plus-Zentral
@nichtich Speaking of which, it would be great to have an English Wikipedia article about K10plus. I was looking the other day and the information about it seemed rather scattered.
https://phabricator.wikimedia.org/T336298#8890900
@nemobis Ok, I wrote an English Wikipedia article about #k10plus https://en.wikipedia.org/wiki/K10plus
@nichtich Wow, I was not aware of it. 225 is quite a large number, it would worth to apply parallelisation with Spark.
@kiru Does QA Catalogue run with any parallelisation at all? The CPU cores in my VM were not saturated but I have not looked deeper.
@nichtich By dafault no. Some years ago I intensively worked with it, but there were lots of changes in the code, so now I am not sure if it still working. Here are the details: http://pkiraly.github.io/2018/01/18/marc21-in-spark/. The description mentions Hadoop and Spark, but Hadoop is not necessary.
I don't know if Spark is required, this probably depends on the analysis task. I opened an issue on parallel execution in general: https://github.com/pkiraly/qa-catalogue/issues/278