2. Materials and Methods
In this section, we demonstrate how to implement the text mining process. First, we prepare all of the needed data and preprocess them into the needed format. Then, we discuss a combination of clustering and bibliometric analysis to identify major academic branches of design research and summarize each academic branch (Section 2.2). Finally, we describe how a two-dimensional text mining approach including bibliometric and network analysis isused to detect research trends from different perspectives (Section 2.3).
2.1. Data Preparation
To give a clear distinction to the research goal of this paper, the definition of design is clarified as follows: design may be a substantive reference to the aesthetic, functional, economic and sociopolitical dimensions of both the design object and design process. The data are provided by Web of Science (WOS). Data retrieval strategy: the subject is "design research" or "design studies"; the publication period is confined from "between year 2004 and 2015" (retrieved data are during May 2016 and June 2016). Accurately retrieved reference type is "Article"; language is "English"; data category is refined as "engineering multidisciplinary or engineering manufacturing or engineering industrial". To better meet the requirements of the definition of design, we expand the number of papers in the main journals, which are discussed in Quality perceptions of design journals: The design scholars' perspective. For obviously unrelated journals, for example Design Codes and Cryptograph, we excluded all of their papers. A total number of 20,218 publications is obtained. We further divide the data into two parts: BASE data and Citation data. BASE data store items containing: title, authors, journal name, keywords, abstract, publishing date. The items in BASE data are fed into the BASE table. Citation data (i.e., references cited by the articles included in the core dataset) store the following items: the cited article's title, the cited article's author, the cited article's journal, total number of the citing article, per year's number of the citing article during 2004–2015. Similarly, the data in the citing dataset are fed into the Citation table.2.2. Identification of Major Academic Branches
- Preprocess the document set, including word segmentation and removing stop words.
- Use Latent Dirichlet Allocation(LDA) to extract features and determine the optimal number of topics (features)
- Calculate the results of the k of 2, 3, 4, 5, 6,7 by the K-means algorithm, then combine the Sum of Squared Error(SSE) and inter-cluster distance to obtain the best classification result. We use the distance between cluster centroids to measure inter-cluster distance. The calculation formula of SSE is listed as Equation (1).
\(S S E= \displaystyle \sum_{i=1}^{k} \displaystyle \sum_{d_{j} \in C_{i}}\left|d_{j}-\operatorname{cen}_{i}\right|^{2}\)
(1)
where \(k\) is the number of clusters, \(C_i\) is the cluster \(i, d_j\) represents the document \(j\) that belongs to the cluster \(i\) and \(cen_i\) is the clustering centroid of the cluster \(i\).
We briefly introduce LDA below. LDA is a hierarchical Bayesian model. It is used to model corpora of documents that can be represented by bags of words. The generative process of document sampling assumes a set of topics, where each document is sampled
from a mixture of topics, and each topic is a discrete probability distribution that defines how likely each word appears in a given topic. As the number of topics affects the fitting performance of the LDA model, therefore we use common criteria
evaluation of perplexity to determine the optimal number of topics. Normally, the smaller the degree of perplexity, the better the LDA model performs. The calculation formula of perplexity is as follows,
\(\operatorname{perplexity}(D)=\exp \left|-\frac{\displaystyle \sum_{i=1}^{n} \log \left(p\left(d_{i}\right)\right)}{\displaystyle \sum_{i=1}^{n} N_{i}}\right|\)
where \(n\) is the number of documents, \(N_i\) represents the length of the document \(d_i\) and \(p(d_i)\) is the probability that the LDA model generates the document \(d_i\).
There are many methods to estimate the model parameters; we use the Gibbs sampling algorithm in this paper. The standard approach to "smooth" the multinomial parameters is assigning positive probability to all vocabulary items whether or not they are observed in the training set.
Based on classification results, we calculate traditional bibliometric indicators such as total outputs, total citation and citation-based impact assessment to provide an architecture overview of design research.
2.3. Detection of the Research Trends of Design Research
In this subsection, we describe how a two-dimensional text mining approach including bibliometric and network analysis is used to detect research trends from different perspectives. The bibliometric characterization aims to assess academic outputs trends and the development trends of the design research area. The network analysis intends to find out research trends in each academic branch of design research and the evolution of core research themes.
Bibliometric analysis involves total outputs and citation-based impact assessment. Counting the number of research publications reflects the impact and usefulness of scientific research output. Additionally, the citation count is a good measure of the influence power of a research paper. High citation indicates more effectiveness, usefulness and productiveness. A previous study computed the values of the Citation Function (CF) and the Co-Citation Function (CCF) to identify the changes of thematic clusters over time. The profile of both functions is similar, and CCF is more expressive. However, a large data collection amount and computational amount are needed for CCF. Therefore, we apply an approach that is based on CF, presented as follows,
\(C F_{t}=\displaystyle \sum_{i}^{N} C_{i, t}\left(\frac{C_{i}}{\max C}+\frac{P_{i}}{\max P}\right)\)
where \(N\) is the number of documents in the analyzed thematic cluster, \(C_{i,t}\) is the number of citations of the document \(i\) in the year \(t\), \(C_i\) is total number of citations of the document \(i\), max \(C\) is the maximum value of the citations in the analyzed documents, \(P_i\) is the period when the document \(i\) has been cited and max \(P\) is the maximum period where the document can be cited for the data under analysis. The value of CF stands for a research domain's development level and is related to the number of citations only. It provides easy identification of the developing research areas and offers an alternative view on the subject development.
Network analysis is carried out to further analyze the social relations among the core themes. Keywords reflect some important information about research trends. In this paper, high frequency keywords are extracted from the abstract and title of scientific literature in each academic branch. After that, networks based on the keyword co-occurrence matrix are constructed to visualize the internal dynamics of major academic branches in the design research area by using Ucinet 6.0.