I recently had the pleasure of giving a guest talk at the 2015 Empirical Research Methods in Informatics (ERMI) Summer School, hosted by Harald Störrle at the Technical University of Denmark. The purpose of my talk was to share some of the insights I had gathered as a PhD student conducting his first Structured Literature Review (SLR). Or, in other words, to hopefully spare attendees the pain of making the same mistakes I had made. As SLRs are quickly gaining acceptance as a research method in Software Engineering, this might be a topic of more general interest – so I decided to make it the topic of my very first blog post.
If, like me in 2013, you have no clue what an SLR is, feel free to think of it as a regular literature review “with a twist”. The twist is that every step of the review (deciding on a motivation and scope, searching for publications, selecting publications to include, extracting relevant data from the selected publications, and reporting on the findings) is clearly described in a review protocol, a document one prepares before embarking on the review proper. An SLR has several advantages over a regular literature review, the chief of which is that it produces reliable, science-grade results, as opposed to an expert opinion. Other advantages are repeatability and a much higher confidence that all relevant literature is covered. Of course, these advantages come at a price: an SLR is usually much more time consuming than a regular literature review. The de facto standard guidelines for conducting an SLR in Software Engineering were published a few years ago by Kitchenham et al., and I strongly recommend reading them for a more thorough introduction to the topic.
SLR or SMS?
One of the first things to decide when embarking on an SLR is its scope – the precise topic under review. This was also the first sticking point for me, as I started out with the rather misguided idea that I could write an SLR on model transformation languages, the broad area of my PhD work. I estimate that there are currently over a hundred such languages in existence, with many hundreds or even thousands of relevant publications addressing them. Covering all of these languages in an SLR, which is generally understood to also address qualitative aspects (i.e. reading the publications in some detail) is very likely a tall order.
Pitfall #1: Adopting an excessively wide scope for the SLR.
Faced with this challenge, my decision was to convert the SLR into a Structured Mapping Study (SMS). Unlike an SLR, a mapping study does not address qualitative aspects of the reviewed papers. As its name suggests, it simply aims to map a field of research by finding out what has been published and which are the unadressed gaps. In this light, an SMS is more of an exploratory endeavor.
An alternative course of action is, of course, to keep the SLR methodology but decrease the scope of the review. In my example regarding model transformation languages, suitable decreased scopes could be debugging support in model transformation or formal verification of model transformations (by the way, someone should actually write these).
Where to search for primary studies
There are two options for performing a search for primary studies: manually searching relevant outlets (journals, conference proceedings), and automatically searching digital libraries (DLs) and indexing databases. I have only ever performed the second type of search, so I cannot give advice on the first.
There are many relevant DLs for Software Engineering research. Apart from publisher’s own libraries, indexing databases such as Inspec, Scopus, and Compendex contain comprehensive bibliographic data. Prior to starting an SLR, I had never heard of these. They are, however, not free to consult.
Tip #1: Consider including indexing databases in your automated search.
Your institution may provide access to a metasearch tool allowing a single search to be executed across many DLs. If such a tool is available, I recommend using it, as it will save a lot of duplicate effort and allow you to circumvent the quirks of individual digital libraries (and there are plenty of quirks, some of which are listed in what follows).
Tip #2: If you have access to one, use a metasearch tool.
The search string
Kitchenham et al. recommend formulating search strings using a conjunctive normal form. This essentially means enumerating the search terms in a logic conjunction, while using disjunctions to specify synonyms for each term. As most DLs support these logic operators, this is a good way to systematically build a search term. However, all DLs place an upper limit on the number of terms that can be included in a search string. I found this limitation to be around 20 terms per string, with slight variations between DLs. If your search string exceeds this limit, you will have to split it into shorter ones and execute several searches.
Tip #3: Long search strings might force you to perform more than one search.
Using wildcards is a tempting method to shorten search strings. For instance, the “*” wildcard in “transformation*” will be matched by most DLs to any string starting with the prefix “transformation”. My experience, however, is that wildcards can considerably increase the number of false positive search results, so I try to avoid them.
One aspect to keep in mind is search term stemming, especially if it is performed by default. Stemming means reducing search terms to their elemental root, such that, for example, a search for the term “testing” will also retrieve matches for the terms “test” and “tested”. The metasearch tool provided by my university performs stemming by default, which adds a large number of false positive hits to my searches (to make matters worse, stemming cannot be turned off).
Pitfall #2: Beware of search term stemming – it might be performed by default.
Exporting search results
Once you have performed your DL search, you will want to export the results in a convenient format (e.g. RIS, BibTeX, CSV) for further processing in a reference management or spreadsheet tool.
The length of search strings is not the only limit imposed by DLs. Every DL I have worked with enforces a cap on the number of search results that can be exported at a time. This cap is usually set around 1000 results. The only way to export all results is to break down the search into several smaller ones.
Pitfall #3: If your DL search returns more than 1000 results, you will likely not be able to export them in one go.
Even more notable is some DLs lack of support for bulk search results export – I’m looking at you, ACM Digital Library. The lack of this feature makes life very hard, if not downright impossible, for researchers conducting SLRs. One workaround that I have found is good old web scraping. Thankfully, tools such as Zotero and Mendeley provide browser plug-ins that do their best to extract lists of references from webpages (in my experience, the Zotero plug-in is more accurate). This method is, unfortunately, not bulletproof. Some references may not be exported properly and will need further manual processing. Furthermore, the paging feature implemented by DLs means that you will have to manually go through each page of search results to execute the scraping process.
Tip #4: Reference scraping browser plug-ins can be used to export search results from DLs that don’t provide an export feature.
Study selection criteria
After exporting your search results, it’s time to apply the study selection criteria in order to decide which of the original results is truly relevant for your SLR. One aspect that I found confusing about this is the suggested use of both inclusion criteria and exclusion criteria. Isn’t inclusion the dual of exclusion? I ended up using both, while assigning them different roles. I first used exclusion criteria as a fast filter for eliminating irrelevant studies based on their title, abstract, and metadata. I then used inclusion criteria for making the final call on whether a paper is included or not, taking into account the full paper contents as well as any quality conditions specified in the protocol.
Tip #5: Decide if you want to use inclusion criteria, exclusion criteria, or both. If you use both, have a clear definition of their respective roles.
The more general suggestion I would make here is to apply the selection criteria in increasing order of the amount of time it takes to evaluate them. The goal is to eliminate irrelevant studies quickly and with as little effort as possible, while avoiding the elimination of relevant studies. To further speed things up, I recommend extracting the data of interest from a study immediately after deciding it should be included in the SLR. Coming back to it at a later time for data extraction will impose an additional time penalty, as you will have to read it again to refresh your memory.
Tip #6: Streamline study processing so that you avoid the time-consuming task of “refreshing your memory” regarding a primary study.
Filling in the gaps
Many of the SLRs I have read complement the systematic search process by manually adding primary studies known to be of interest that were not returned by the search, as well as by performing reference snowballing. One of the ERMI participants expressed his concern that this might undermine the value of the original search. My view is that filling in the (sometimes inevitable) gaps in the DL search results is beneficial, as long as the “gaps” don’t turn out to be larger than the search results themselves – that would indicate that an inaccurate search term was used.
Tip #7: Reference snowballing and even manually adding relevant studies to the SLR is not only acceptable, but recommendable.
Quality assessment criteria
When assessing the quality of a primary study, it’s important to have a predefined list of quality assessment criteria to evaluate. I found that the criteria suggested by Kitchenham et al. are a good starting point, although their level of detail might not correspond to the often insufficient amount of study design information presented in Software Engineering papers. Conducting a pilot study will help calibrate the quality assessment criteria with the level of study design details presented in the studies of interest.
Tip #8: Calibrate your quality assessment criteria with the amount of study design details presented in the primary studies – in Software Engineering, this amount is unfortunately rather low.
Data visualization
It is often difficult to find the most expressive type of visualization for a given data set. At the same time, practically all SLRs report on the same kind of data, such as included/excluded studies, and numbers of studies in different categories. Adopting a set of visualization best practices could be beneficial to readers. For instance, Sankey diagrams are a great way of visualizing the outcomes of a study selection process, and bubble charts offer an expressive visualization of study categories.
When searching for an appropriate visualization, I often find myself consulting The Data Visualization Catalog, a useful online collection of common types of plots and graphs.
Tip #8: Choose your data visualization methods wisely, possibly consulting a visualization catalog beforehand.
Final thoughts
The nuggets of advice presented above may be obvious for those with even a moderate level of experience regarding SLRs. However, they would have collectively saved me several weeks of work had I known about them when starting out. I shared them here in the hope that someone reading this post will be able to save some time. After all, time seems to be one of the most valuable resources needed for an SLR.
As a side-note, the SLR-turned-SMS on model transformation languages I started two years ago is now shelved, possibly indefinitely. In the end I found the scope simply too large. I have since focused on a more narrow scope, which in my view is also more interesting – but that is a topic for another post.