Web-based innovation indicators may provide new insights into firm-level innovation activities. However, little is known yet about the accuracy and relevance of web-based information for measuring innovation. In this study, we use data on 4,487 firms from the Mannheim Innovation Panel (MIP) 2019, the German contribution to the European Community Innovation Survey (CIS), to analyze which website characteristics perform as predictors of innovation activity at the firm level. Website characteristics are measured by several data mining methods and are used as features in different Random Forest classification models that are compared against each other. Our results show that the most relevant website characteristics are textual content, the use of English language, the number of subpages and the amount of characters on a website. Furthermore, using several website characteristics jointly improves predictions of reported innovation activity up to 20 percentage points in comparison to our baseline model. Moreover, results also indicate a better performance for the prediction of product innovators and firms with innovation expenditures than for the prediction of process innovators.


Text as data, innovation indicators, machine learning