Mining Wikipedia to Rank Rock Guitarists
Автор: Muazzam A. Siddiqui
Журнал: International Journal of Intelligent Systems and Applications(IJISA) @ijisa
Статья в выпуске: 12 vol.7, 2015 года.
Бесплатный доступ
We present a method to find the most influential rock guitarist by applying Google PageRank algorithm to information extracted from Wikipedia articles. The influence of a guitarist was estimated by the number of guitarists citing him/her as an influence and the influence of the latter. We extracted this who-influenced-whom data from the Wikipedia biographies and converted them to a directed graph where a node represented a guitarist and an edge between two nodes indicated the influence of one guitarist over the other. Next we used Google PageRank algorithm to rank the guitarists. The results are most interesting and provide a quantitative foundation to the idea that most of the contemporary rock guitarists are influenced by early blues guitarists. Although no direct comparison exist, the list was still validated against a number of other best-of lists available online and found to be mostly compatible.
Wikipedia mining, PageRank for people, information extraction, text mining, music data mining
Короткий адрес: https://sciup.org/15010776
IDR: 15010776
Текст научной статьи Mining Wikipedia to Rank Rock Guitarists
Published Online November 2015 in MECS
Music artists are ranked based upon a variety of criteria such as their popularity, skill level, album sales etc. These ranks are important to the artists themselves as they result into an increased fan base and popularity, and to the fans, as the latter would like to see their favorite musicians at the top spots. Like other musicians, guitarists are ranked based upon their creativity, skill level at the instrument and their influence over other guitarists as well as the genre as a whole. A number of such best-of lists are available on the Internet. These lists are primarily generated through crowdsourcing where fans vote for their favorite artists and/or compiled by subject matter experts such as music journalists, critics or guitarists themselves. These lists have always been controversial and a source of argument among fans when they do not find their favorite artist in the position they were expecting them to be. In this paper we combined techniques from information extraction and graph mining to find the most influential rock guitarists. The influence of a guitarist was computed by considering the number of guitarists citing him/her as an influence and, in turn, their own influences. This information about influences is available in the biographical sketches on Wikipedia of these guitarists. The Wikipedia page for most of the guitarists lists the guitarists who influenced their playing. The information is usually available within the article in an unstructured form such as X cites X1, X2, …, Xn as influences. We extracted this information from the Wikipedia pages, identified the influencer and the influencee and converted this to a directed graph where nodes represented guitarists and edges represented the influence relationship. The presented work makes two main contributions:
-
1. Using a quantitative method to find the most
-
2. Estimation of influence from the guitarist
influential guitarist
community itself, instead of fans
It should be noted that our method finds the most influential guitarists and not the best guitarist. The latter would require measurement of different performance indicators. Another important point to note is that the current work includes the guitarist articles in English Wikipedia only, but the techniques presented here can be easily modified to incorporate Wikipedia articles in other languages and other categories such as influential philosophers, musicians etc.
This paper is organized as follows. A review of related work is presented in section II. Section III describes the corpus creation process from Wikipedia. Extraction of influencee, influencer pairs is described in section IV. Section V briefly describes PageRank and its usage to rank guitarists. Results are presented in section VI.
-
II. Related Work
A number of magazines related to music or otherwise have published their own lists of best guitarist. These include Rolling Stone, Time, Telegraph, Esquire, Guitar World, Revolver Mag etc. These lists are essentially generated manually using one or a combination of the following methods:
-
1. Music journalists rank the guitarists based upon their perceived influences
-
2. Users are asked to vote for their favorite guitarist
-
3. Guitarists are asked to vote for their favorite
-
A. The Lists
A brief overview of these lists is provided below. A comparison of results will be provided in the later section of this paper.
-
1) Music Expert Compilation
The Telegraph compiled a list of greatest guitarists of all time. No description of the method is provided so we assume it was done by their staff [1]. The Time magazine music critic Josh Tyrangiel compiled a list of top 10 electric guitar players of all time [2]. Spin magazine staff compiled a list of 100 greatest guitarist of all time [3]. The list created quite a controversy as it favored alternative rock guitarists of later times more than the traditional names.
-
2) User Voting
The Guitar World conducted a Readers Poll to find the 100 greatest guitarists of all time [4]. To their own admission, “the method was not perfectly scientific, with extra matchups some bizarre pairing and occasional omissions”. Gibson conducted their own poll to find the 50 greatest guitarists of all time [5]. To compensate for any omissions or other errors, they asked users to join the debate in the comments section of the article.
-
3) Guitarist Compilation
The Rolling Stone magazine assembled a number of top guitarists and other experts and asked them to rank their favorites to generate a list of 100 greatest guitarists of all time [6]. It is not clear though how the overall ranking was achieved. For example the list has Jimi Hendrix as the greatest guitarist of all time as ranked by Tom Morello of Rage Against the Machine. What is not described are the other contenders for the top spot or other guitarists ranked by Mr. Morello.
-
B. Ranking People
There have been previous attempts to use ranking algorithms such as PageRank or HITS for people. The algorithm, combined with others, was used to rank people based upon their historical significance in Who’s Bigger [7]. The primary data source was Wikipedia and the significance was computed for a person was based upon five criteria applied to his/her Wikipedia article. Two of them were derived from PageRank while three included the number of article views, the number of edits and the length of the article. The work was criticized to solely rely on Wikipedia to determine a person’s historical significance and for cultural biases inherent in Wikipedia. To study the latter and the organization of concepts in Wikipedia, [8] used PageRank and the HITS algorithm. Using the aforementioned algorithms they performed a network analysis of Wikipedia using its link structure to estimate the relevance of each article. The results were provided for the most relevant entries, followed by countries and cities, people and events. In another study PageRank, along with two other algorithms was used by [9] to investigate the interaction of cultures and top peoples in history. Their study revealed both, the cultural dependence of local figures and the existence of global historical figures across different language editions. One of the results of their study was a list of top 100 historical figures, based upon their appearance in different Wikipedia language editions. Recently, [10] have used PageRank to rank cricket team. The data were represented using a directed graph with a team as a node and edge representing a match between two teams with losing team pointing towards the winning team. Besides ranking entities, data mining algorithms have been used to predict movie success [11], [12], in the entertainment industry.
Our work is different from the previous attempts in two aspects. First, it takes a quantitative approach and second, the influence is computed among the guitarist community and not users. While the lists prepared by Rolling Stone and Guitar World also claims the latter, the approach is more qualitative than quantitative. Our approach relies on Wikipedia to identify the influences. Whether Wikipedia is a reliable source for this information is a question not addressed in this paper
-
III. Corpus Creation
We employed a corpus based approach to identify the influences of a given guitarist. This who-influenced-whom data was extracted from parts of a document that matched specific predefined patterns. The original source of our corpus was the English Wikipedia. At the time of corpus creation, Wikipedia dump was slightly over 10 GB in compressed form containing more than 3.6 million articles. We used the WikiExtractor Python [13] script developed by the Media lab to extract and clean text from the dump. The script does not need prior uncompressing of the dump file. The extracted articles are stored in multiple files of equal size, which needs to be provided as a parameter to the script. We selected 4 MB as the size which resulted in 2,429 files for the entire Wikipedia dump. We will refer to this collection as M. Each of the files in M contains multiple articles separated by a pair of
To filter the guitarist articles from the extracted text, we used the following four Wikipedia list pages:
-
1. List of guitarists
-
2. List of lead guitarists
-
3. List of slide guitarists
-
4. List of rhythm guitarists
We extracted the name of the guitarists and the link to the Wikipedia article from each list page and combined the four lists into one and removed the duplicates. The resulting list contained 2380 guitarists. To filter the guitarist articles from the extracted text, we used a script that matched the name of the guitarist from the list against the title of each article. This method resulted in a number of false negatives (missed guitarist articles) as the lists contained the first and the last name of the guitarist, while the article may carry the full name resulting in both false positives (other articles identified as guitarist articles) and false negatives (missed guitarist articles). To rectify the issue, we used the document id assigned to each Wikipedia article. To get the id for each guitarist, we scrapped the Wikipedia pages for the URLs from the list. We were only able to locate 2,337 articles on Wikipedia against the 2,380 guitarists URLs present in our list. The likely cause is that the article was not created on Wikipedia although a URL was generated just using the name of the guitarist. From each of these scrapped Wikipedia pages, the id was determined using a simple string match and stored as
-
IV. Identification of Influencer - Influencee PAirs
As the influence of a guitarist is estimated as a function of the number of guitarist citing him/her as an influence and their own influences, the first and foremost task after corpus creation was to identify these influencer-influencee pairs from the documents. This, essentially, is a named entity recognition task coupled with a filter that only keeps those parts of the documents which discussed the influences of a guitarist. This was achieved by applying sentence segmentation to the article and keeping only those sentences where the word influence (including all its variations) was found. We also experimented with the words with similar meaning such as inspire (including all its variations) but they did not bring any additional entries. The distribution of these influence sentences in articles is displayed in Fig.1. The mean and the five number summary for the same can be found in Table 1. Although we were only able to find these influence sentences for 37% of the guitarists only, most of the famous guitarists were covered. Notable guitarists which were missed include Chris Degarmo of ex-Queensryche and Adam Jones of Tool.
Table 1. Summary of the Number of Influence Sentences per Article
Measure |
Value |
Minimum |
0 |
Q1 |
0 |
Median |
0 |
Mean |
0.709 |
Q3 |
1 |
Maximum |
13 |

-
Fig.1. Distribution of the Number of Influence Sentences in Articles
To identify the guitarist names in the sentence, we used named entity recognition. The sentence segmentation and named entity recognition was done using the Stanford CoreNLP [14]. We only kept the entities tagged as PERSON or ORGANIZATION. To identify the influencer and influence in the sentence we defined a set of regular expressions that captured the following patterns with X i as influencee and Y j as influencer.
-
1. X cite(s|d) Y 1 , Y 2 , …, Y n as (an) influence(s)
-
2. X was influenced by Y1, Y2, …, Yn
-
3. Y has been cited as an influence by X1, X2, …, Xn
-
4. Y influence on X1, X2, …, Xn …
-
5. Y 1 , Y 2 , …, Y n influenced hi(m|s) …
-
A. Entity Resolution
Entity resolution refers to identifying and combining multiple mentions of the same named entity into one. It is customary in English to refer to a person through his/her last name only, e.g. Jimi Hendrix and Hendrix, once the full name has been mentioned earlier. To resolve such cases in our data, we prepared a list of full names, and matched entities consisting of a single name against the last names in the list. If a match was found, the single name was replaced by the full name. In the case of a multiple name matches, we used the most likely name. To identify the most likely name, we ran the PageRank algorithm once without any entity resolution and computed an initial PageRank value for each guitarist. In case of multiple name matches, the guitarist with the highest PageRank value was identified as the most likely candidate. For the cases, where the difference between ranks was small, e.g. Albert King and B. B. King, the data was corrected manually. Some other erroneous cases were also removed from the data. These include a guitarist citing himself/herself as an influence or two guitarists citing each other. The latter is considered a possibility but most of the cases that we manually analyzed, pointed to an error in identifying influencee and influencer using the regular expressions.
-
B. Guitarist Band Resolution
An analysis of the influencee-influencer pair data revealed that a number of guitarists cited bands as influences also in addition to or instead of individual guitarists. This, in turn, affect the ranking of a guitarist as some of the guitarist would cite the band that he/she is/was associated with instead of the guitarist himself/herself, thereby reducing the rank of the former. To resolve this issue, we hypothesized that the guitarist citing a band as an influence, is citing the sound of the band as shaped by its guitarist, and hence, is/was influenced primarily by the guitarist. Testing this hypothesis was beyond the scope of our work. We devised a guitarist band resolution algorithm to replace band names with guitarist names in the influencee-influencer pairs. This algorithm made use of the same Wikipedia list pages employed before by us as an index to the guitarist articles. The list pages contains the band name(s) for each guitarist that he/she has been associated with. The Table 2 displays the mean and the five number summary of the number of guitarists that have been associated with a band. The distribution of the same is given in Fig.2. The maximum number of guitarists associated with a band in its history is 11 and the band is Guns N’ Roses. In case of a band employing multiple guitarists, e.g. James Hatfield and Kirk Hammett in Metallica, Jeff Hanneman and Kerry King in Slayer, the band name was replaced by the guitarist with higher initial PageRank value.
Table 2. Mean and Five Number Summary of the Number of Guitarists in a Band
Measure |
Value |
Minimum |
1 |
Q1 |
1 |
Median |
1 |
Mean |
1.493 |
Q3 |
2 |
Maximum |
11 |

2 4 6 8 10
Fig.2. Distribution of the Number of Guitarists in a Band
-
C. Filtering Non Guitarists
After the guitarist band resolution, our data consisted of 2868
pairs. At this stage
another filter was applied to remove all the entities which were not listed as guitarists on the Wikipedia list pages. This step was important so a guitarist’s influence is estimated by considering other guitarists only citing him/her as an influence and not other musicians. After the removal of musicians who were not guitarist the list consisted of 660 guitarists only.
-
V. Guitarist Ranking
As it is mentioned earlier in this paper, the data were represented in the form of a directed graph. Each node in the graph represented a guitarist and an edge between nodes indicated influence of one guitarist over the other. The edge pointed from influencee to the influencer. The final graph consisted of 2824 nodes and 3814 edges.
Next we will provide a brief description of the PageRank algorithm, and how it was used to rank the guitarists.
-
A. PageRank
The PageRank algorithm [15] models a random surfer who can click on any of the outgoing links from a given page with an equal probability. Thus the page with more incoming links has a better chance to be visited by this random surfer. PageRank for a page, therefore represents the probability of this random surfer reaching this page. Next we will describe how the PageRank value for a guitarist was computed.
Let u be a guitarist, Fu be the set of guitarists u cited as an influence ( u ’s influencers) and Bu be the set of guitarists that cited u as an influence ( u ’s influencees). Let N u = |F u | be the number of u ’s influencers. Using the PageRank algorithm, u ’s rank R(u) can be computed as
W = ^+^ еВи \. (1)
In the above equation, N is the total number of guitarists and d is the damping factor which serves as a normalization constant.
From the equation it is clear that each guitarist received part of the influence score of all the guitarists he/she influenced.
We used the PageRank implementation available in the iGraph [16] package in R [17]. The input was provided in the edge format, where each line carried an influencee-influencer pair indicating an edge in the graph.
-
VI. Results
Some of the names in the list were unexpected and interesting while most of the other names are compatible with other, manually created, lists available on the Internet. The mean and the five number summary of degree-in (number of influencees of a guitarist), degree-out (number of influencers of a guitarist) and PageRank are provided in Table 3. The degree-in, degree-out and PageRank followed a power law distribution as evident from the Figure 3, Figure 4, Figure 5 respectively.
Table 3. Mean and the Five Number Summary of the Degree-In, Degree-Out and the PageRank
Measure |
Degree In |
Degree Out |
PageRank |
Minimum |
0 |
0 |
0.000880 |
Q1 |
0 |
0 |
0.000880 |
Median |
1 |
1 |
0.000968 |
Mean |
2.259 |
2.259 |
0.001515 |
Q3 |
2 |
3 |
0.001425 |
Maximum |
72 |
18 |
0.016500 |

Fig.3. Plot of Degree-In Displaying a Power Law Distribution

Fig.4. Plot of Degree-In Displaying a Power Law Distribution

Fig.5. Plot of Degree-In Displaying a Power Law Distribution
Table 4 lists the comparison of the top 10 guitarists determined by our algorithm with the Rolling Stone, Guitar Word, Gibson and Telegraph. Six out of the top ten guitarist from our list can be found in other lists with different positions. The most influential guitarist of all time as determined by our list and others is Jimi Hendrix with more than seventy different guitarists citing him as an influence. Some of the names in our list are unexpected and are not found in other lists. The most prominent of those is Josh White at the third spot. Instead of an error, we would call it a discovery, as he is cited by 19 guitarists as an influence.
One important point to note is that PageRank takes the out-degree of the influencee, i.e. the number of influencers cited by the influencee, into account. The PageRank of the influencee is evenly divided among all the influencers. One of the reasons for a higher rank of Josh White is that the average out-degree of his influencees is 3.11 which is lower compared to Hendrix, Clapton or Page. Each one of the latter has an average influencee degree-out of more than five. What this translates to is a higher influencee PageRank being transmitted to White since it will be distributed to a lower number of influencers. The inclusion of Charlie Christian at the number two spot can be explained by the fact that the guitarists citing him as an influence include two of the top ten inductees, namely Django Reinhardt and Wes Montgomery. The list of the 100 most influential guitarists as measured by out method can be found in Table 5.
Even though the degree-in and PageRank are highly correlated with a correlation coefficient value of 0.85, the effect of PageRank can still be observed in the bottom right quadrant of the Figure 6, where cases with low degree-in but high PageRank values are present.
We experimented with different values of the damping factor with slightly varying results. The reported results are using a 0.6 value of the damping factor.

Fig.6. PageRank Vs Degree In
-
VII. Conclusion and Future Work
We presented a method to rank rock guitarists based upon the number of influence citations. The citation data were mined from the Wikipedia articles. The unstructured data from the articles went through different text processing and information extraction steps to be converted into a structured format of a directed graph. To rank the guitarists we used the PageRank algorithm that takes into account the number of guitarists citing someone as an influence and their own influences as well. The results were compared against other, manually generated, lists, revealing conformities as well as differences. The main contribution is the application of a quantitative method in the ranking process. The method can easily be extended to incorporate Wikipedia articles written in other languages and people in other categories.
In the future we would like to estimate the total error in predicting the rank of a guitarist. Different sources of error exists in the system, but the most important of them are the NER component used and the regular expressions to identify
Acknowledgment
The author wishes to thank David Gilmour, Tony Iommi and Jimmy Page for continued support and inspiration.
Список литературы Mining Wikipedia to Rank Rock Guitarists
- T. Staff, "The greatest guitarists of all time, in pictures," [Online]. Available: http://www.telegraph.co.uk/culture/culturepicturegalleries/9618556/The-greatest-guitarists-of-all-time-in-pictures.html. [Accessed 18 02 2015].
- J. Tyrangiel, "The 10 Greatest Electric Guitar Players," 18 02 2015. [Online]. Available: http://content.time.com/time/photogallery/0,29307,1916544,00.html.
- S. Staff, "SPIN's 100 Greatest Guitarists of All Time," SPIN, 3 05 2012. [Online]. Available: http://www.spin.com/articles/spins-100-greatest-guitarists-all-time/. [Accessed 18 2 2015].
- G. W. Staff, "Readers Poll Results: The 100 Greatest Guitarists of All Time," Guitar World, 10 10 2012. [Online]. Available: http://www.guitarworld.com/readers-poll-results-100-greatest-guitarists-all-time. [Accessed 18 02 2015].
- "Gibson.com Top 50 Guitarists of All Time – 10 to 1," 28 05 2010. [Online]. Available: http://www2.gibson.com/news-lifestyle/features/en-us/top-50-guitarists-528.aspx. [Accessed 18 02 2015].
- D. Browne, P. Doyle, D. Fricke, W. Hermes, B. Hiatt, A. Light, R. Tannenbaum and D. Wolk, "100 Greatest Guitarists," Rolling Stone, 23 11 2011. [Online]. Available: http://www.rollingstone.com/music/lists/100-greatest-guitarists. [Accessed 18 2 2015].
- S. Skien and C. Ward, Who's Bigger? Where Historical Figures Really Rank, Cambridge University Press, 2013.
- F. Bellomi and R. Bonato, "Network analysis for Wikipedia," in Proceedings of Wikimania 2005, The First International Wikimedia, 2005.
- Y.-H. Eom, P. Aragón, D. Laniado, A. Kaltenbrunner, S. Vigna and D. Shepelyansky, "Interactions of cultures and top people of Wikipedia from ranking of 24 languages," Submitted to PLOS ONE, 2014.
- Daud, F. Muhammad, H. Dawood and H. Dawood, "Ranking Cricket Teams," Information Processing & Management, vol. 51, no. 2, pp. 62-73, 03 2015.
- D. Barman, R. Singha and N. Chowdhury, "Prediction of Possible Business of a Newly Launched Film using Ordinal Values of Film-genres," International Journal of Intelligent Systems and Applications(IJISA), vol. 5, no. 6, pp. 53-60, 2013.
- R. Sharda and D. Delen, "Predicting box-office success of motion pictures with neural networks," Expert Systems with Applications, vol. 30, pp. 243-254, 2006.
- M. Lab, "WikiPedia Extractor".
- Manning, M. Surdeanu, J. Bauer, J. Finkel, S. Bethard and D. McClosky, " The Stanford CoreNLP Natural Language Processing Toolkit," in Proceedings of 52nd Annual Meeting of the Association for Computational Linguistics: System Demonstrations, 2014.
- L. Page, S. Brin, R. Motwani and T. Winograd, The PageRank Citation Ranking: Bringing Order to the Web, 1999.
- G. Csardi and T. Nepusz, "The iGraph Software Package for Complex Network Research," InterJournal, vol. Complex Systems, p. 1695, 2006.
- R. D. C. Team, R: A Language and Environment for Statistical Computing, Vienna: R Foundation for Statistical Computing, 2008.
- J. Finkel, T. Grenager and C. Manning, "Incorporating Non-local Information into Information Extraction Systems by Gibbs Sampling," in Proceedings of the 43nd Annual Meeting of the Association for Computational Linguistics (ACL 2005), 2005.