In Search of Lost Profiles:
The Reliability of VKontakte Data and its Importance in Educational Research
I. Smirnov, E. Sivak, Y. Kozmina
Received in Ivan Smirnov
March 2016 Research Assistant, Institute of Education, National Research University Higher School of Economics. Email: ibsmirnov@ hse.ru
Elizaveta Sivak
Research Fellow, Institute of Education, National Research University Higher School of Economics. Email: esivak@ hse.ru
Yana Kozmina
Junior Researcher, Institute of Education, National Research University Higher School of Economics. Email: ikozmina@ hse.ru
Address: 20 Myasnitskaya St, 101000 Moscow, Russian Federation.
Abstract. The potential of VKontakte (VK), the Russian equivalent of Face-book, as a data source is now acknowledged in educational research, but little is known about the reliability of data obtained from this social network and about its sampling bias. Our article investigates the reliability of VK data, using the examples of a secondary school (766 students) and a university (15,757 students). We describe the procedure of matching VK profiles to real students.
Direct matching permitted us to identify the profiles of around 18% of students. A special technique offered in the article increased this number up to 88% for school students and up to 93% for university students. We compared age, gender and GPA of identified students and those whom we did not find on VK. We also compared the structure of social relationships, retrieved from VK data, to the expected structure of students' social ties. We found that the structure of "virtual" social relationships reproduced both the socio-demographic division of students into grades, years of study or majors, and the spatial division into different school buildings or university campuses. To our knowledge, it is the first study of this kind and scale based on VK data. It contributes to the understanding of how reliable data from this SNS is, how its accuracy can be improved, and how it can be used in educational research.
Keywords: social networks, VKontakte, social network analysis, data reliability, friendship ties, academic achievements, school, university.
DOI: 10.17323/1814-9545-2016-4-106-122
We would like to thank our anonymous reviewer from Vo-prosy obrazovaniya for her/his valuable comments.
Social networking services (SNS) have become an integral part of the lives of millions of people who use them to communicate with friends, exchange ideas, find jobs, organize events, and many other things [Boyd, Ellison 2008]. Facebook, the largest social networking site in the world, was founded only 12 years ago and now has around
1.5 billion monthly active users1. It is no surprise that researchers become concerned about how social networking sites influence various spheres of life, including education [Hew 2011; Aydin 2012; Wilson, Gosling, Graham 2012; Tess 2013; Koroleva 2015].
The particular attention towards social networking services also has to do with the fact that they revolutionized the availability of demographic and social data [Boyd, Ellison 2008]. Even the most comprehensive educational studies rarely involve more than a few tens of thousands of people, and most were restricted to much smaller samples. The largest international student survey, The Program for International Student Assessment (PISA), covered 510,000 school students from 62 countries in 2012 [OECD 2014]./ Meanwhile Kramer with his colleagues [Kramer, Guillory, Hancock 2014] published the results of a Facebook experiment involving 700,000 users. The most far-reaching Facebook-based experiment involved 61 million users [Bond et al. 2012].
Not only do social media allow for doing research on a previously unavailable scale, but they also provide answers to new questions. Thus, friendship networks and peer effects have traditionally been investigated using surveys [Lomi et al. 2011; Flashman 2012; Ivaniushi-na, Alexandrov 2013; Dokuka, Valeeva, Yudkevich 2015]. However, this method provides no opportunity to establish and analyze the relations among students from different educational institutions. Such relations were a blind spot for education researchers until recently, but today they can be identified and explored using social media data. Social networking services also open the door to longitudinal data on social relations, providing access not only to status information but also to the whole history of user interactions [Lazer et al. 2009].
International research has traditionally focused on Facebook as the most popular SNS in the world. Seventy-one percent of American Internet users have Facebook profiles [Duggan et al. 2015]. The percentage is even higher in specific categories: in some universities, 96% of students use Facebook [Martin 2009]. Researchers explore how Facebook usage influences social integration of students [Madge et al. 2009], their social capital [Ellison, Steinfield, Lampe 2007; Stein-field, Ellison, Lampe 2008] and psychological well-being [Steinfield, Ellison, Lampe 2008]. VKontakte (VK) is the Russian equivalent of Facebook, and its data source potential has also begun to attract the attention of researchers. In particular, they find out how time spent on VK affects exam performance [Krasilnikov, Semenova 2014], analyze how friendship networks develop [Dokuka, Valeeva, Yudkevich 2015], show how VK data could be used to explore academic mobility [Alexandrov, Karepin, Musabirov 2016], etc.
1 Facebook (2015) Statistics. Facebook, Palo Alto, CA. http://newsroom. fb.com/company-info
However, there have been few examples of using VK data for educational research purposes so far. This appears to be rather difficult because little is known about the reliability of data obtained from this social network and about its sampling bias. For instance, School No. 1 of St. Petersburg is infamous for allegedly going to have 3,000 graduates in 2019, as judged by VK data. Difficulties also arise when researchers try to match listed students directly to their social media profiles. School and university students do not always specify their colleges and often use alternative forms of their names.
Our article investigates the reliability of VK data, using the examples of a Moscow school and a Moscow university. At the first stage, we obtained lists of school students indicating their GPAs, gender, grade and the school building they studied in, and lists of university students indicating their year of study, major and performance. Next, we searched for the VK profiles of those students. A direct matching (exact matches between full names and educational institutions on VK and in real life) only permitted us to identify the profiles of around 18% of students. By using information on friendship ties and a dictionary of first names with alternative forms, we increased the number up to 88% for school students and up to 93% for university students. We compared groups of students identified by different means and those whom we did not find on VK. We also added information on friendship ties and compared the reconstructed virtual university network to the real one.
We managed to demonstrate the possibility of retrieving highly reliable data from VK and the consistence between the structures of social networks reproduced using this data and those of educational institutions, including the division of schools into buildings and grades and division of universities into campuses and majors. To our knowledge, it is the first study of this kind and scale based on VK data. Its findings will help education researchers use the social media potential more effectively.
1. VK data search By signing up for VK, Internet users accept the VK Terms of Service, software and under which they "understand that the personal information posted by procedure the User may become available to other Site Users and Internet users, be copied and disseminated by such users"2. VK, in its turn, provides an API (application programming interface) allowing one to do automatic search queries and receive information on users in cases where the user does not prefer to hide such information.
The software we developed makes requests to the VK API and obtains a list of all users who specified the given educational institution, within the predetermined age range. Then, it matches the obtained
VK (2016) VK.com Privacy Policy https://vk.com/privacy
2
profiles to the list of students provided by the educational institution, by full name. However, only a small percentage of students can be found on VK by such direct matching. To extract more information from the SNS, we applied two additional techniques.
First, we developed a dictionary of first names with alternative forms. If the software found Latin symbols in the user's surname, the operator would be offered a translation. This helped us to identify the users who wrote their surnames with Latin symbols, e. g. "Nabokov" instead of "Набоков". If the software found the same surname in the list of users as in that of students, it would ask the operator whether the first names were matching. As a result, we identified those users who used short forms of their first names, e. g. "Vova Nabokov" instead of "Vladimir Nabokov". All translations and recorded name matches (or mismatches) were saved to a special dictionary, so the operator did not have to answer the same question again.
Second, the software searched not only the users who specified the given educational institution but also the users who had a lot of friends from there. This technique, traditional for social network analysis, is used, for example, in Mislove et al., 2010.
In order to protect personal data of school students, we developed a customized version of the software that was launched locally on school computers and deleted all the names and VK logins after completing the procedure. Only completely anonymized data was used for further research. Information on university students (lists of students by majors and their academic achievements) was retrieved from publicly available sources (the university website). After the matching procedure, the student names and logins were deleted and only anonymized data was used.
The matching procedure revealed three groups of students: those not found on VK, those found by direct matching, and those found by our own method. We compared these groups based on their size, students' gender, age and academic performance. We used a chi-square test and a i-test to calculate the p-value.
We also constructed friendship networks and compared them to the structures of educational institutions. We expected that students of the same grade, year of study or major would be closely interconnected. To quantitatively express the effect of such group-based division, we calculated modularity Q. This value shows the proportion of friendship ties connecting students within one group (grade, major, etc.) reduced by the expected number of such connections in case they were distributed randomly. Q = 0 means the absence of any tendency to generate links within the group. The closer Q is to 1 (maximum value), the denser connections between the nodes within groups. In practice, Q takes on values from 0.3 to 0.7, while higher values are rare [Newman, Girvan 2004].
Table 1. Percentage of students whose VK profiles were found using the proposed methods
Friend list
Dictionary of first names with alternative forms
No
Yes
No Yes
18 27
57 88
Table 2. Percentages of identified VK users who did not indicate their school and/or used alternative forms of their names, by age (grade)
Grade
5 6 7 8 9 10 11
Percentage of students identified (%) 85 89 88 90 88 91 85
Percentage of students who did not indicate their school (%) 64 72 69 74 70 58 72
Percentage of students who used alternative forms of their names (%) 39 36 29 33 33 31 38
2. Data reliability Using the VK API, we found 908 users claiming to be under 18 and 2.1. School to be a student of the given school. Meanwhile, this school has only 766 students in grades 5-11, according to the list. This means that a portion of the students provided false information about themselves.
The number of VK friends claiming to be students of the same school can be an effective tool in identifying the real profiles of school students. Out of 458 VK users with no friends from the same school, only four (i. e. less than 1%) actually study in that school. Among the top 100 users with the highest number of friends in the given school, at least 83% are students of that school (Table 1).
The resulting coverage can be compared to a study analyzing Facebook profiles of American students, where the second-wave coverage was 84.6%, according to the publicly available data [Lewis et al. 2008].
Table 2 compares groups of students based on their age (grade). Approximately equal percentages of students were revealed in the social network for all age cohorts. The incidence of using alternative forms of names and indicating the school in the profile also does not change from grade to grade. None of the differences in the table approaches any statistical significance. The p-values calculated by a chi-squared test are above 0.5.
Table 3. Groups of school students differing in the way of presenting their personal data on VK, by gender and academic performance
Girls (%) GPA
Found on VK 46 3.80
Not found on VK 48 3.79
Those who did not indicate their school 48 3.77
Those who used alternative forms of their names 50 3.79
Table 4. Percentages of identified VK users who used alternative forms of their names, by age (year of study)
Year
1st 2nd 3rd 4th
Percentage of students identified (%) 92 94 94 93
Percentage of students who used alternative forms of their names (%) 30 32 32 34
Table 5. Groups of university students differing in the way of presenting their personal data on VK, by gender and academic performance
Girls (%) GPA
Found on VK 59 7.34
Not found on VK 58 7.13
Those who used alternative forms of their names 71 7.37
Similarly, no gender or GPA gap is observed between the groups of students found on VK, not found on VK, those who did not indicate their school, and those who used alternative forms of their names (Table 3), p-values being above 0.5.
2.2. University Similar results were obtained for university students. Out of 15,757 students, 93% were found on VK. This value varies from 75% to 100%, depending on the major.
There is no age difference between students found and not found on VK or between those using alternative and given names, yet students not found on VK perform on average worse (p-value < 10-8), and girls use alternative names more often than boys (p-value < 10-11).
Alternative name usage is different between school and university students. Twenty-seven percent of all alternative forms of names
Figure 1. The VK friendship network reproduces school division into grades. Students of the same grade are mostly connected with friendship ties. The wider the age gap, the less likely students are to be friends
• Grade 11
Figure 2. The VK friendship network reproduces school division into buildings
o Buildings 2, 3
used by university students are typed using the Latin alphabet, while the proportion is only 8% among school students.
3. Friendship network structure
3.1. School
We constructed a friendship network for all school students identified on VK (Fig. 1). We used ForceAtlas2 graph layout algorithm and Ge-phi Software for network visualization [Jacomy et al. 2014]. The higher connection between the nodes, the closer they are brought to one another by the algorithm. The resulting network structure corresponds to the division into grades, modularity Q = 0.47, the distance depending on the age disparity and being the greatest between the youngest and the oldest grades. The friendship network is additionally broken into two major clusters corresponding to different buildings of the recently merged schools, Q = 0.35 (Fig. 2).
3.2. University The VK friendship network reproduces division into years of study, Q : 0.58 (Fig. 3), campuses, Q = 0.32, and majors, Q = 0.68 (Fig. 4).
Figure 3. The VK friendship network reproduces major division into years of study. The wider the age gap, the less likely students are to be friends
o 1st year o 2nd year » o 3rd year • 4th year
Figure 4. The VK friendship network reproduces university division into majors and campuses located in different cities. The figure shows the friendship network of fourth-year students. Campuses are highlighted in different colors. The visible clusters within the campuses correspond to majors
4. The prospects of using VK data in educational research
VK as a source of data offers a huge potential for educational research. However, using this data is associated with certain methodological difficulties. The results of our study allow us to provide specific recommendations on how to overcome those difficulties.
It does make sense to discount users with no VK friends from the same educational institution from the list of users claiming to be students of that institution. Only 1% of such users actually attend the given institution.
When matching the list of students to the list of VK users, alternative forms of names should be considered, as they are used by 35% of students. An effective identification tool is the additional search in friends of identified users: 69% of students do not indicate their educational institution in their profiles.
When using social media data, special attention should be paid to the potential sampling bias. For instance, it can be expected that middle school students will be represented less on VK than high school
students, or that lower-performing students will provide incorrect information more often, etc. Yet, we revealed no significant gender, age or performance differences between the groups of students of grades 5-11 found and not found on VK. Exceptions include a slightly lower GPA of students not identified on VK and a more frequent use of alternative forms of names by girls.
The total coverage of 88% for school students and 93% for university students indicates that the SNS is used by nearly all students. It would be interesting to reproduce this result using a larger sample and, particularly, to compare different regions and cities.
Our findings also confirm the value of information on VK friendship ties. We demonstrate that the structure of these ties correlates with the structure of the educational institution, reproducing not only the division of students into grades, years of study and majors, but also the spatial structure, such as the division of school into buildings.
Social networking services allow for a new perspective on traditional educational research topics. Since the late 1970s, researchers have been developing the tradition of studying social and cultural capital [Bourdieu 1986; Coleman 1988; Putnam 2001], and these constructs have also proved significant in educational research [DiMag-gio 1982, Goddard 2003; Lareau, Weininger 2003]. Special emphasis is placed on the social reproduction of inequality [Bourdieu, Passeron 1990; Stanton-Salazar, Dornbusch 1995]. Today, we have a unique opportunity to test the sociological theories on newly available extensive empirical data.
Information on school students' cultural capital can be reconstructed from their specified interests, subscriptions and VK pages followed, all of which characterize their tastes and cultural preferences [Liu 2007; Lewis et al. 2012]. As for the social capital, SNS data allows for identifying weak ties (being friends on VK) as well as strong ones (comments on each other's posts, likes, etc.). Such data appears to be much more comprehensive and detailed than results of sociometric studies, which hardly ever go beyond contacts within one grade, ignoring cross-age and inter-school connections.
Social networking sites make it possible to investigate the relationship between social and cultural capital and academic achievements at both school and individual levels. Not only can geographical and social segregation be revealed and tracked online, but also the mechanisms of inequality reproduction can be studied: how students influence each other (peer effects, influence of friends on students' attitudes), how social and cultural capital affects the choice of educational trajectories (transfer to another school, transfer from school to university), etc.
The use of social media data not only opens new doors for education researchers but also raises new ethical questions. The availability of information on users does not depend exclusively on what they decide to make available anymore. For instance, it is possible to re-
store information on the users' university, graduation year and major [Mislove et al. 2010], sexual orientation [Bhattasali, Maiti 2015], romantic partners [Backstrom, Kleinberg 2014], or ideological affiliations [Bakshy, Messing, Adamic 2015] quite accurately. In our study, we show that even "naive" means are enough to determine the school of students who decided not to self-report it on VK. Advanced machine learning algorithms will do it even better. SNS data often has to be merged with some additional information obtained from publicly available sources or educational institutions. With such matching, special attention should be paid to personal data anonymization in order to ensure privacy.
References Alexandrov D., Karepin V., Musabirov I. (2016) Educational Migration from Russia to China: Social Network Data. Proceedings of the 8th ACM Conference on Web Science, May 22 to May 25, 2016, Hannover, Germany, pp. 309-311. Aydin S. (2012) A Review of Research on Facebook as an Educational Environment. Educational Technology Research and Development, vol. 60, no 6, pp. 1093-1106.
Backstrom L., Kleinberg J. (2014) Romantic Partnerships and the Dispersion of Social Ties: A Network Analysis of Relationship Status on Facebook. Proceedings of the 17th ACM Conference on Computer Supported Cooperative Work & Social Computing, February 15-19, 2014, Baltimore, Maryland, USA, pp. 831-841.
Bakshy E., Messing S., Adamic L. A. (2015) Exposure to Ideologically Diverse News and Opinion on Facebook. Science, vol. 348, no 6239, pp. 1130-1132. Bhattasali N., Maiti E. (2015) Machine "Gaydar": Using Facebook Profiles to Predict Sexual Orientation. Available at: http://cs229.stanford.edu/ proj2015/019 report.pdf (accessed 10 October 2016). Bond R. M., Fariss C. J., Jones J. J., Kramer A. D., Marlow C., Settle J. E., Fowler J. H. (2012) A 61-Million-Person Experiment in Social Influence and Political Mobilization. Nature, vol. 489, no 7415, pp. 295-298. Bourdieu P. (1986) The Forms of Capital.Cultural Theory: An Anthology, Malden,
MA: Wiley-Blackwell, pp. 81-93. Bourdieu P., Passeron J. C. (1990) Reproduction in Education, Society and Culture (Theory, Culture & Society). London: Sage Publications. Boyd D. M., Ellison N. B. (2008) Social Network Sites: Definition, History, and Scholarship. Journal of Computer-Mediated Communication, vol. 13, no 1, pp. 210-230.
Christakis N. A., Fowler J. H. (2013) Social Contagion Theory: Examining Dynamic Social Networks and Human Behavior.Statistics in Medicine, vol. 32, no 4, pp. 556-577.
Coleman J. S. (1988) Social Capital in the Creation of Human Capital. American
Journal of Sociology, vol. 94, no 1, pp. 95-120. DiMaggio P. (1982) Cultural Capital and School Success: The Impact of Status Culture Participation on the Grades of US High School Students. American Sociological Review, vol. 47, no 2, pp. 189-201. Dokuka S., Valeeva D., Yudkevich M. (2015) Koevolyutsiya sotsialnykh setey i akademicheskikh dostizheniy studentov [Co-Evolution of Social Networks and Student Performance]. Voprosy obrazovaniya/Educational Studies Moscow, no 3, pp. 44-65. Dokuka S., Valeeva D., Yudkevich M. (2015) Formation and Evolution Mechanisms in Online Network of Students: The VKontakte Case. Analysis of
Images, Social Networks and Texts (eds M. Y. Khachay, N. Konstantinova, A. Panchenko, D. I. Ignatov, V. G. Labunets), pp. 263-274.
Duggan M., Ellison N. B., Lampe C., Lenhart A., Madden M. (2015) Social Media Update 2014. Available at: http://www.pewinternet.org/2015/01/09/so-cial-media-update-2014/ (accessed 10 October 2016).
Ellison N. B., Steinfield C., Lampe C. (2007) The Benefits of Facebook "Friends:" Social Capital and College Students' Use of Online Social Network Sites. Journal of Computer-Mediated Communication, vol. 12, no 4, pp. 1143-1168.
Flashman J. (2012) Academic Achievement and Its Impact on Friend Dynamics. Sociology of Education, vol. 85, no 1, pp. 61-80.
Goddard R. D. (2003) Relational Networks, Social Trust, and Norms: A Social Capital Perspective on Students' Chances of Academic Success. Educational Evaluation and Policy Analysis, vol. 25, no 1, pp. 59-74.
Hew K. F. (2011) Students' and Teachers' Use of Facebook. Computers in Human Behavior, vol. 27, no 2, pp. 662-676.
Ivaniushina V., Alexandrov D. (2013) Antishkolnaya kultura i sotsialnye seti shkol-nikov [Anti-School Culture and Social Networks in Schools]. Voprosy obra-zovaniya/Educational Studies Moscow, no 2, pp. 233-251.
Jacomy M., Venturini T., Heymann S., Bastian M. (2014) ForceAtlas2, a Continuous Graph Layout Algorithm for Handy Network Visualization Designed for the Gephi Software. PLoS ONE, vol. 9, no 6, e98679.
Koroleva D. (2015) Ispolzovanie sotsialnykh setey v obrazovanii i sotsializatsii po-drostka: analiticheskiy obzor empiricheskikh issledovaniy (mezhdunarodny opyt) [Using Social Networks in Education and Socialization of Teenagers: Analytical Review of Empirical Studies (International Experience)]. Psikho-logicheskaya nauka i obrazovanie, vol. 20, no 1, pp. 28-37.
Kramer A. D., Guillory J. E., Hancock J. T. (2014) Experimental Evidence of Massive-Scale Emotional Contagion through Social Networks. Proceedings of the National Academy of Sciences, vol. 111, no 24, pp. 8788-8790.
Krasilnikov A., Semenova M. (2014) Do Social Networks Help to Improve Student Academic Performance? The Case of Vk.com and Russian Students. Economics Bulletin, vol. 34, no 2, pp. 718-733.
Lazer D., Pentland A. S., Adamic L., Aral S., Barabasi A. L., Brewer D., Jebara T. (2009) Life in the Network: The Coming Age of Computational Social Science. Science, vol. 323, no 5915, pp. 721-723.
Lareau A., Weininger E. B. (2003) Cultural Capital in Educational Research: A Critical Assessment. Theory and Society, vol. 32, no 5-6, pp. 567-606.
Lewis K., Gonzalez M., Kaufman J. (2012) Social Selection and Peer Influence in an Online Social Network. Proceedings of the National Academy of Sciences, vol. 109, no 1, pp. 68-72.
Lewis K., Kaufman J., Gonzalez M., Wimmer A., Christakis N. (2008) Tastes, Ties, and Time: A New Social Network Dataset Using Facebook.com. Social Networks, vol. 30, no 4, pp. 330-342.
Liu H. (2007) Social Network Profiles as Taste Performances. Journal of Computer-Mediated Communication, vol. 13, no 1, pp. 252-275.
Lomi A., Snijders T. A., Steglich C. E., Torlö V. J. (2011) Why Are Some More Peer than Others? Evidence from a Longitudinal Study of Social Networks and Individual Academic Performance. Social Science Research, vol. 40, no 6, pp. 1506-1520.
Madge C., Meek J., Wellens J., Hooley T. (2009) Facebook, Social Integration and Informal Learning at University: 'It Is More for Socialising and Talking to Friends about Work than for Actually Doing Work'. Learning, Media and Technology, vol. 34, no 2, pp. 141-155.
Marginson S. (2014) University Rankings and Social Science. European Journal of Education, vol. 49, no 1, pp. 45-59.
Martin C. (2009) Social Networking Usage and Grades among College Students. Available at: http://www.pdfpedia.com/download/15925/social-networking-usage-and-grades-among-college-students-pdf.html (accessed 10 October 2016).
Mislove A., Viswanath B., Gummadi K. P., Druschel P. (2010) You Are Who You Know: Inferring User Profiles in Online Social Networks. Proceedings of the Third ACM International Conference on Web Search and Data Mining, February 3-5, 2010, New York City, USA, pp. 251-260.
Newman M. E., Girvan M. (2004) Finding and Evaluating Community Structure in Networks. Physical Review E, vol. 69, no 2, 026113.
OECD (2014) PISA 2012 Technical Report, OECD: Paris.
Putnam R. (2001) Social Capital: Measurement and Consequences. Canadian Journal of Policy Research, vol. 2, no 1, pp. 41-51.
Stanton-Salazar R.D., Dornbusch S. M. (1995) Social Capital and the Reproduction of Inequality: Information Networks among Mexican-Origin High School Students. Sociology of Education, vol. 68, no 2, pp. 116-135.
Steinfield C., Ellison N. B., Lampe C. (2008) Social Capital, Self-Esteem, and Use of Online Social Network Sites: A Longitudinal Analysis. Journal of Applied Developmental Psychology, vol. 29, no 6, pp. 434-445.
Tess P. A. (2013) The Role of Social Media in Higher Education Classes (Real and Virtual)-A Literature Review. Computers in Human Behavior, vol. 29, no 5, pp. A60-A68.
Wilson R. E., Gosling S. D., Graham L. T. (2012) A Review of Facebook Research in the Social Sciences. Perspectives on Psychological Science, vol. 7, no 3, pp. 203-220.
Whitley B., Keith-Spiegel P. (2002) Academic Dishonesty: An Educators Guide, New Jersey: Lawrence Erlbaum Associates.