What drives SME formalization in Tanzania? An assessment using the machine learning techniques

Identifying the motives for business formalization is important for policy-making and the smooth coordination of entrepreneurs in Tanzania. This paper employs four machine learning (ML) models to investigate the effects of several contingents and institutional factors on SME formalization in Tanzania. Using an updated large dataset of the World bank enterprise survey in 2013, this research relies on 698 firms level data and 37 explanatory variables. The feature importance through SHAP values analysis proves that firm location in a business city, sales revenue, number of full-time employees, firm experience in the sector, and internal and external finance are the most significant factors that positively affect SME's formalization. This paper will be a reference for the actors and stakeholders in business and entrepreneurship to make good policies to drive business formalization in Tanzania.


Introduction
In recent years, the formalization of small and medium enterprises has increased.The report from Ishengoma (2018) indicated the increase of SMEs being formalized by registering to respective government authorities.Government organs also take different initiatives to mobilize the formalization.For instance, the government program of property and business formalization and the government vision 2025 aim to push the registration of businesses currently operating informally (Hamisi, 2021).A higher rate of business formalization indicates the business owner's understanding of registering their business and the economic advantages they will earn for their business being registered at the firm and national levels (Rand & Torm, 2012).Business formalization can be the action of a firm to graduate from being informal to being formal.In Tanzania, a firm or business is considered formally registered if it attains the business license from the local authority and is registered as a taxpayer at the revenue authority.Subject to the nature and size of the firms, especially medium and large firms, the law wants to register the firm to Business Registration and Licensing Agency (BRELA) as a sole proprietor, company, partnership, or trust (Ishengoma, 2018).That is different from informal businesses, which operate in unknown areas, don't pay taxes or levies, and so they termed to be out of the government business system (Ishengoma & Kappel, 2007).Accordingly, following recent studies on formalization, such as Cling et al. (2012) and Ishengoma (2018): this paper explains informal SMEs as an entity functioning without being registered by government authorities and with no legal permission to run the business.This paper investigates the likely features that may influence the formalization of small and medium enterprises in Tanzania.I examine the contingent and institutional factors in predicting the formalization of SMEs using the machine learning (ML) method.Furthermore, to address the topic, the study used the logit method as a baseline model and ensemble ML models such as Random Forest, Decision Tree and eXtreme Gradient Boosting (XGBoost) to come up with an optimal method that predicts SMEs formalization.An ML can fit a functional form and accommodate flexible functions among predictors and outcome variables.I created a prediction model, and through literature and the recursive feature elimination method, I identified some crucial variables that were considered a predictor of the decision on SMEs to become formal or remain informal.I apply this ML approach using the current enterprise survey data set from the World bank library.The data set entails rich information on the registration of firms and various contingent and institutional factors among firms in the manufacturing and service sector in Tanzania.To the best of my knowledge, this is the first study that utilizes the ML approach and uses the World bank enterprise survey to predict SMEs formalization in Tanzania.
The formalization of firms has been researched worldwide and different conclusions have been revealed.The study by Benhassine et al (2018) in Benin evidenced that it is important to encourage informal firms to be formal.On top of that, it indicated that sales amounts have a negative influence on a firm's formalization.The same results were indicated by the study done in Brazil by Rocha et al (2018).In addition, it also highlighted that the costs of formalization negatively affect a firm's formalization in Brazil.The tax rates factor is also pointed out by studies to be an important factor that influences firms' formalization (Piza, 2018;Rocha et al., 2018;Etim & Daramola, 2020;Campos et al., 2023).In Africa, the formalization agenda is becoming interesting to some scholars since the formalization of the business improves not only the economic development of the country but also the inclusive economy among people and the government of the respected country.In this regard, some studies indicate factors that might push the registration of firms, especially SMEs in Africa.For instance, in Tanzania, Ishengoma (2018) paper revealed that factors like male owners aged between 31 and 50, having at least secondary education, and motivation positively affect SMEs formalization.On the other hand, firm age and location negatively affect the formalization decision in Tanzania.The study by Nelson & De Bruijn (2005) in Tanzania concluded that formalization cost, government procedures, unawareness of tax administration, and access to premises influence firms' formalization.Guma (2015) studied business formalization in Uganda and pointed out that male enterprise owners, tax policies and regulations, corruption, access to capital, political pressure, and instability tend to influence enterprise formalization decisions.In addition, García-Bolívar ( 2006) indicated that tax rates, corruption, government pressures, registration costs, and access to finance are the most important factors influencing business formalization.Moreover, Charman et al. (2013) studied enforced formalization in South Africa and revealed a positive relationship between a firm's formalization decision with total sales and a manager's experience in liquor retailing.Furthermore, the study by Arimah (2001) in Nigeria identified subcontracting agreements, firm size (large size) and access to finance as important features for a firm to become formal.
Unlike the above prior studies that utilize conventional statistical approaches, this paper adopted the Machine Learning (ML) approach to identify the most vital predictors that may influence SMEs formalization in Tanzania.The crucial difference between these two approaches is that, in a conventional statical model, the researcher knows the model from which data have been generated, and then the unknown parameters of the same model are estimated from the data (Ley et al., 2022).On the other hand, because conventional statical models may lead to inaccuracy results, the ML approach can tune the parameters during training to give out the best prediction of features on outcome variables without assuming the model (Ley et al., 2022).ML algorithms can solve overfitting and multicollinearity problems but also can search for functions that are good predictors out of the sample (Lee & Lee, 2021).For this reason, the ML is a more suitable approach to explain the important factors influencing enterprise formalization in Tanzania using an enterprise survey from the World bank.The research questions of this paper are: What Motivates SMEs Formalization in Tanzania?What are the most factors that drive the formalization of enterprises in Tanzania?The paper contributes to the literature by adding ML knowledge application on business formalization to accurately predict and identify the real factors that influence SMEs to be formal during their operation.It also adds knowledge on the utilization of World bank microdata to research different aspects of SMEs in different economies, especially the experience of SMEs from developing economies like Tanzania.
The remaining sections of this paper include the literature review of some key terms of the paper, including SMEs, formalization in Tanzania, and contingent and institutional factors.Then, the methodology section highlights the data source, descriptive statistics, variables definitions, types of ML models used, and their motivation.The other section is the results section which talks about the feature selection, model tuning, model performance, and interpretation.Finally, the paper concludes, recommends and highlights the limitation.

SMEs formalization in Tanzania
Economies all over the World have started concentrating more on formalization of small and medium enterprises (SMEs).That is due to the globalization of trade and efforts made by the private sector in many developing countries.In addition, the sector has become a substantial element of economic development and the creation of employment (Anderson, 2017).In Tanzania, the sector employs about four million people, accounting for 20 to 30 per cent of the total labor force and contributing between 35 and 45 per cent of the country's gross domestic product (Anderson, 2011).
Although the definition of SMEs differs from country to country, in Tanzania's context, the micro, small and medium enterprises are defined by URT (2003) in its SMEs Development policy.According to policy, the micro-enterprise is one with 1-4 employees and maximum capital of 5 million; a small enterprise hires 5-49 employees and from 5 to 200 million as capital.The medium enterprise possesses 50-99 employees and 200 to 800 million; the large enterprise has more than 100 and capital of over 800 million.
According to Ishengoma & Kappel (2007), the formalization of business in Tanzania is where a firm decides to graduate from informal to formal.An informal firm has been explained in terms of nature, size, legal status, location, nature of employee, resource endowed, and land.The enterprises are informal because they operate outside the government laws/system, but they produce and sell lawful products or services (Spring, 2009;Benhassine et al., 2018).On the other hand, formalization is where an enterprise acquires important certificates such as a business license from the relevant government organs.Then, while operating, it should continue to adhere to the government regulations regarding the business (Cling et al., 2012Campos et al., 2023).This paper relies on the factors that influences enterprises to graduate from the informal to the formal state of operation.It specifically aimed to identify the category of factors between contingent and institutional factors that may drive enterprises to become formal.

Contingent Factors
Contingent factors referred to in this paper are anything an enterprise cannot accurately predict or plan (Sauser, 2009).Looking at the contingent theory literature, the contingent factors such as technology, culture, the size of an enterprise, and the environment affect the design and functioning of firms.This ML paper will consider, among others, the way firm size, competition, corruption, crime, theft, culture, environment, and technology affect the decision of SMEs to be formal.

Institutional Factors
Conversely, institutional factors are associated with institutional pressure enforced through normative and coercive mechanisms that influence revenue growth (Mbelwa, 2015).Institutional factors are associated with social, political, regulatory, and cultural aspects in shaping organizational behavior in form and process (Scapens, 2006).Institutional factors define the players in a different institutional context.For example, actors in government procedures in business registration, tax administration, and forming tax rates are to be paid by firms as stipulated by an institutional framework that includes a set of formal and informal social, legal, and political rules and norms (Scott, 1987).Therefore, political and administrative or technical actors are powerful, and their behavior is highly influenced by the institutionalized budget framework (Mzenzi, 2013).Regarding institutional theory literature, various institutional factors such as pressure from government regulations, difficulties in registration, tax rates, tax administration, business licensing and permits, transport, access to finance, access to land, firms owned by females, ownership, and availability of skilled workforce to mention few, are tested to investigate their influence on SMEs formalization using ML techniques.

Research Methodology Data Source
This paper utilized enterprise survey data from the World bank micro-library.The World Bank conducted this enterprise survey in different years, including 2006, 2009, and 2013.(https://doi.org/10.48529/rgvk-7f42).This data, which World Bank recently updated, allows suitable research within the respective country and the World at large (World Bank, 2022).The data set entails firms from different categories such as textile, food, medical and chemical, hotel, transport, etc.The source also has various contingent and institutional factors such as firm size, age, corruption, tax issues, registrations, etc.It also entails financial information like working capital financing, total sales, and expenses in business operations.The enterprise survey covers both small, medium, and large firms.
To make the study reliable and valid, I decided to use the enterprise survey of 2013, which took a large sample of firms compared to other previous enterprise surveys.The survey was conducted on 698 firms in big cities of Tanzania, namely Dar es Salaam, Arusha, Mbeya, Mwanza, and Zanzibar.

Definition of Variables and Descriptive Statistics
The data set had several variables connected with SMEs in Tanzania.They are in different disciplines but categorized into institutional and contingent.The assortment of variables utilized as core factors for predicting SME formalization was initially grounded on insights from the literature review.Also, from the author's knowledge, the additional variables of interest that are related to SMEs formalization were then added.Finally, from the sixty-one variables selected in these criteria, I remained with thirty-seven variables that were selected using the recursive feature elimination method, as discussed in the feature selection section.Table 1 below indicates the list of the final variables after selection, their definitions, and descriptive statistics.The overall descriptive results show that percentage of working capital financed from internal funds, per cent of total annual sales of the product or service, years the firm was formally registered, number of permanent, full-time employees at the end of the last fiscal year, and percentage the largest owner(s) own the firm has a higher mean score compared to other remaining variables.

Definition of Sample
From sixty-one variables that were there initially, I selected thirty-seven features using recursive feature elimination with five crossvalidations that were finally employed in this study to investigate the influencing features for SMEs formalization in Tanzania.In the full data set, there were 698 observations.I randomly divided the sample from the full sample by 80 percent to get a training set for fitting the model and 20 percent to obtain a test set for predicting the model.Therefore, finally, the training set had 558 observations, and the test set had 140 observations.

Machine Learning Techniques
Different Machine Learning (ML) algorithms are discussed in this section.The paper utilized ML techniques to improve the prediction of the formalization of firms in Tanzania.The first baseline model is Logit regression which represents taste variation related to observed characteristics (Bhat, 1998).Since Logit regression is a variant of the standard linear model, I further consider ML prediction models, which allow non-linear relations and improve complex and accuracy effects in the covariates (Hastie et al., 2009).
Moreover, other ML models such as Random Forest, XGBoost, and Decision tree become suitable because Logit and other linear regression models did not perform well in tuning the data and hence had a problem with overfitting (Cortes & Vapnik, 1995).On top of that, these boosting algorithms are easy to understand and interpret the results of each model, which is important in picking the best model (Shahram et al., 2021).Also, the selected models are complex, which can find the more interesting pattern in the data, which leads to better performance.Finally, since the data set has more features, the complexity of the model also increases (Rundo et al., 2019).These models can handle more features using built-in feature selection methods.
Moreover, using the same features, I compared Logistic regression with non-linear models such as Random Forest, XGBoost, and Decision tree to check each algorithm's performance and conclude the most technique that can be accurately used to predict the formalization decisions among SMEs in Tanzania.From these algorithms, Logit regression is known for its strength in making classification models in linear form.In contrast, the remaining algorithms have strength in solving the multicollinearity problem and overfitting issues.On top of that, due to the availability of missing values in some of the numeric variables in the data set used for this study, the non-linear algorisms especially XGBoost, were used because of their power to handle the missing value situations (Ding et al., 2018;Lee & Lee, 2021).I used the F1 and accuracy measures to evaluate which models outperform others.From there, I used the best model to identify important features using SHAP values analysis.While F1 and accuracy are popular metrics for model performance in the classification model, the difference between the two is that F1 harmonizes correctness and recall on the positive class.At the same time, accuracy focuses on appropriateness in positive and negative classified observations (McKee & Weber, 2021).Additionally, accuracy is good when there is similar class distribution, whereas the F1 score is appropriate in imbalanced classes (Chicco & Jurman, 2020).For both measures, the higher the value, the better the model in classifying observations into categories.

Feature Selection
Initially, there were 61 predictors in the data set.I selected 37 features with institutional and/or contingent behavior using an ML algorithm.I have adopted recursive feature elimination with five-fold cross-validation and hyperparameter tuning for each model to identify these variables, providing the highest accuracy.The feature selected, among others are: whether the entity is in the business city, the number of full-time employees of the firm, the percentage of working capital financed from internal funds, the percent of total annual sales of the product or service, and the year of which the firm was formally registered.Also, whether the firm communicates with clients and suppliers by e-mail, years of experience of the top manager in the sector, whether the financial statements are certified by external auditor, percentages of working capital financed by local sources and firm age.Moreover, in tuning the model, I performed prediction measures before and after feature selection.Table 2 below indicates the performance results before and after feature selection.

Algorithm Selection
All the Machine learning algorithms were developed using the training data that were randomly split into five folds cross-validation with stratified sampling techniques.The models were trained on all but one of the folds, and performance were measured on the part left out in the training process.The prediction accuracy was computed based on majority voting from 5 folds runs, and the hyperparameters that maximize precision were used to build a final model.

Comparison of Predictive Performance among Machine Learning Algorithms
I primarily consider formalization as a binary outcome variable for the main results, and the 37 variables were selected as the main features to be considered in this paper.I use an 80% ratio on the data set as the training to tune ML models and test the out-of-sample performance of these models in a 20% data set.From there, I gauge the predictive performance of each ML model using f1 and accuracy methods.Table 3 below indicates the performance among models.above shows that the logistic regression model produces an F1 score figure of 64.7 percent and an Accuracy rate of 69.3 percent.The decision tree's performance has an F1 score of 69.4 percent and 71.4 percent accuracy.XGBoost performance figure is 73.2 percent for the F1 score and 75.7 for accuracy.Overall, Random Forest has the highest accuracy power in predicting SMEs formalization with an accuracy rate of 94.3 percent and an F1 score of 93.8 percent.

Interpretation
The model performance results show that the Random Forest model outperforms XGBoost, Decision tree, and Logit regression by having a higher accuracy value and F1.That means that the Random Forest technique overcomes the overfitting issue and is the best model for classifying observations into various classes.In this regard, it is more accurate in predicting business formalization by giving a higher value than other models.The logistic regression model performed poorly because the model could not find more interesting patterns from complex data with more features.Additionally, the model could not fit data with nonlinearities among predictor and outcome variables.
In terms of feature importance analysis, I classified the feature importance by most significant gain to smallest gain in the Random Forest algorithm to predict business formalization in Tanzania.The gain infers the relative contribution of the corresponding feature to the model.A higher value of this measure suggests the importance of the feature in prediction compared to another feature.As per Figure 1 below, a dummy variable of whether the firm is located in a business city is a leading factor for SMEs decision to formalize their businesses with a positive impact value of 0.07.The possible reason for this can be that the business cities like Dar es salaam are where most SMEs are located, and the government authority has regulations for coordinating their business operations which untimely necessitate formalizing the business.It is also noted that the number of full-time employees motivates formalization.As expected, the firm's internal finance and sales also facilitate their formalization.That can be due to the existence of some formalization costs that are incurred during formalization, such as business license, payment of initial tax instalment, rental costs, etc. other factors are crime and theft, the experience of the firm in the sector, external finance, year of establishment and whether certified accountant checks final accounts.According to SHAP values, features are ranked in descending order.The x-axis shows the signs of the effect on SMEs formalization, and the magnitude of its effect is given in the dot's color.That means the blue dots indicates the negative effect, whereas the red dot specifies the positive impact.
Figure 2 above reveals whether the firm located in the business city is the most influencing factor that pushes the model to predict SMEs formalization in Tanzania.The other nine features to make ten important features are mentioned according to their hierarch rank.These are the number of full-time employees, internal finance and sales amounts, crime and theft, the experience of the firm in the sector, external finance and whether a certified accountant checks final accounts, ICT issues (whether a firm has its own website and whether it communicates with the client through e-mails) and firm age.On top of that, it can be observed from figure 2 that the color of the dots does change either from blue (negative relationship) to red (positive relationship) or red to blue.That indicates the non-linear relationship of key factors with the outcome variable.
The results on enterprise location indicate a positive effect, contrary to the study of Ishengoma (2018) findings, which revealed a negative effect of the formalization decision with a firm location in Tanzania.However, previous studies by García-Bolívar (2006) and Arimah (2001) revealed the same results: internal or external access to finance, firms' size and age are important factors in business formalization.Additionally, the results of this paper are consistent with those of Charman et al. (2013), who concluded the positive relationship between business formalization and managers' experience in the sector and the firm's sales turnover during the year.

Conclusion
This study investigates the effect of various features on SMEs formalization in Tanzania.Using the ML strategies, I have identified most variables that can highly predict SMEs formalization.From there, I evaluated the level at which each variable facilitates the predicted SMEs formalization.The results from ML reveal that the formalization of businesses in Tanzania is highly connected with institutional factors like firm location in the business city, sales revenue, number of full-time employees, the experience of the firm in the sector, and internal and external finance.Moreover, the contingent factors that were revealed to influence formalization are crime and theft, an obstacle to licensing and telecom.
The results of this paper can suggest three potential recommendations.Firstly, because the results reveal the institutional factors that can facilitate an enterprise's formalization, the study recommends that the firms' owners and operators invest more time, skills and competency to ensure a smooth internal operation.Particularly, they should find more sources of the fund within the nature of the business and try to have skilled if not competent people to assist in operating the business.Secondly, since the location of the firm in the business city is the most influencing factor for SMEs to register and become formal, the relevant government authorities such as local governments, the President's office responsible for regions, and local governments administration should improve the business environment for SMEs by establishing/constructing important infrastructures so as even regions which are not cities can be relatedly like cities.That will motivate SMEs to operate their business smoothly, increase their sales and decide to be formal.Last but not least, stakeholders such as financial institutions should make more strategies to finance SMEs; this action will improve their sustainability and hence the possibility of becoming formal is high.Because some SMEs have low liquidity/revenue, it pulls them not to register their firms.The institutional literature declares that firms with high revenue volume tend to have no doubt in becoming formal and accommodate the operation rules for registered firms, such as complying with tax rules, financial regulations and other institutional procedures.
Despite the fact that this paper adds to the existing body of knowledge, it did not search more on potential hyperparameters.Additionally, it used only five folds cross-validation for all models.This may probably bring different results to the one who will use more than five folds.This can be an opportunity for further research where a researcher can use more than five folds cross-validations, more tuning of hyperparameters, and even use other performance metrics apart from accuracy and F1 score to reveal the precise prediction of business formalization.Nevertheless, the author has confidence that the results are sufficient to explain the prediction of the variables for SMEs formalization in Tanzania.

Figure 1 :
Figure 1: Feature Importance by Gain for The First 37 Features On top of the feature importance analysis, I also conducted a Shapley Additive explanation (SHAP) values analysis to interpret results from the Random Forest model.As indicated by Lundberg & Lee (2017), SHAP values explain a specific prediction as a sum of each feature's effects into a conditional expectation.Thus, SHAP has a global interpretability advantage by indicating how much each feature contributes to the target variable (Schalck & Schalck, 2021).From this perspective, Figure 2 below indicates the global importance of the features to predict SMEs decisions for formalization.

Figure 2 :
Figure 2: Important of the features using SHAP Values analysis

Table 2 :
Performance Results Before and After Feature Selection

Table 3 :
Performance Measures Between Ml and Logit Models