Top languages in global information production

Sergey Lobachev
Casual Reference Librarian
London Public Library, ON


The amount of information produced around the world has grown rapidly during recent decades. The advancement of digital technology contributed to this growth by providing a solution for effective storage of large amounts of data. It was estimated that at the beginning of the new millennium information available in print, film, optical and magnetic formats was equivalent to about "250 megabytes for each man, woman, and child on earth" (Varian and Lyman). This statistic may create a false impression that availability of information is also growing. However, there are many barriers which prevent access to information resources. The so-called "Digital Divide" is the most well-known and most discussed issue in library literature, but it is certainly not the only one (Aqili and Moghaddam; James; Berube, etc.).

This paper attempts to examine global information production from a cultural perspective. Its goal is to answer the question: "Which languages are most widely used in the production and dissemination of information?" In other words, if we were to gather all books, journals, films and web pages published and created on the planet, what part of this huge collection would be available in English, French, Spanish, Chinese and other languages? One might agree that English would be at the top of the list, but what language would follow it? What would the top ten languages be? What percentage of overall information resources might each language comprise?

Answering these questions will enable us to better understand the diversity of the information universe and to determine current trends in global information production.

Methods and Data

Information exists in numerous formats. The scope of this research was limited to those information sources which are commonly available through the public domain, i. e. libraries and the Internet. These include books, academic journals, newspapers and popular magazines, films, and web pages. Government documents, archival materials, technical documentation, and computer files were excluded from this analysis, although they constitute the major part of global information resources (Varian and Lyman). The overall goal was not to provide precise and comprehensive data (an enormously difficult task), but rather to give a sense of the kind of information environment in which we are living. Obtaining relative results was more important than calculating statistics.

The first step was to determine the distribution of different languages for each type of information resource. This data was summarized and compared to the percentage of the world's literate population using each corresponding language. Literacy was considered a minimum requirement for accessing and using information, which in most cases is composed of textual characters. Exceptions might include audio, video, and graphic materials. This comparison permitted an estimate of the discrepancy between the population which is potentially capable of using information and the amount of information available in different languages.

The following data were collected for the various information formats.


The UNESCO Institute for Statistics (UIS) remains the only organization which provides relatively reliable numbers about global book publishing (UNESCO Institute for Statistics). The UIS collects publishing data from questionnaires distributed every second year to all member states of UNESCO. Information on non-member states and territories is collected from other sources. The disadvantages of this method are similar to those affecting all survey research. They include non-response, delays, misinterpretation of questions and unavailability of data due to political or economic circumstances (Altbach and Hoshino 165). In general, the response rate tends to be higher for countries with more planned or controlled economies.

The most recent statistics on global book publishing were released in the last edition of the UNESCO Statistical Yearbook (1999). The data was collected by country, using general categories of the Universal Decimal Classification System (UDC). A book was defined as a non-periodical printed publication of at least 49 pages (UNESCO statistical yearbook). The accuracy and completeness of the reported data varies from year to year, due to inconsistent response rates. The most complete data on the number of book titles published worldwide is available for 1995 (total 918,964 titles) (Altbach and Hoshino 164). The language of publication was determined based on the official or widely used language in a particular country (Gordon and Grimes). Table 1 shows the results obtained for book production by number of titles for the top languages around the globe.

Table 1. Book publishing by language.

Language Number of titles Percentage of total
English 200,698 21,84 %
Chinese (Mandarin) 100,951 10,99 %
German 89,986 9,78 %
Spanish 81,649 8,88 %
Japanese 56,221 6,12 %
Russian 48,619 5,29 %
French 44,224 4,81 %
Korean 35,864 3,90 %
Italian 34,768 3,78 %
Dutch 34,067 3,71 %
Portuguese 33,430 3,64 %

Newspapers and magazines

The data about periodicals for a general audience, such as newspapers and magazines, was retrieved from Ulrich's Periodicals Directory. This is the world's largest database of bibliographic and publisher information about more than 300,000 serials of all types. This database allows a search of serial titles by language. If a serial includes an abstract, summaries or section in additional languages, this information is also provided (Ulrich's Periodicals Directory 1: xi). In other words, more than one language could be assigned to one periodical. Many international publications, for example, provide abstracts and summaries in English. Newspapers in minority languages often include articles written in majority languages. Taking into account all periodicals regardless of how often they are published may cause a significant statistical error when retrieving large amounts of data. To reduce potential error, the search strings were limited to daily serials, which in most cases are published in one language.

The type of document was limited to the following categories: "consumer", "trade", "newspaper" and "newsletter/bulletin". The following table represents the data for serials with "active" status in 2007. 1

Table 2. Newspaper and magazine production by language.

Language Number of titles Percentage of total
English 2499 62.55 %
Spanish 277 6.93 %
German 235 5.88 %
Chinese (Mandarin) 156 3.90 %
Hindi 117 2.93 %
French 95 2.38 %
Polish 44 1.10 %
Russian 38 0.95 %
Italian 36 0.90 %
Portuguese 35 0.88 %

Scholarly journals

The same method was used for scholarly journals. The document type selected in Ulrich's Periodicals Directory was "academic/scholarly". In this case, all types of serials were considered regardless of frequency of publication. Table 3 shows the number and percentage of scholarly journal titles with "active" status in 2007. 1

Table 3. Scholarly journal production by language.

Language Number of titles Percentage of total
English 28,131 45,24%
German 6,848 11,01%
Chinese (Mandarin) 4,047 6,51%
Spanish 3,522 5,66%
French 3,074 4,94%
Japanese 2,149 3,46%
Italian 1,860 2,99%
Polish 1,060 1,70%
Portuguese 1,055 1,70%
Dutch 922 1,48%
Russian 808 1,30%

Films and video

The most comprehensive resource on film and video production is the Internet Movie Database (IMDb). It covers more than 1.2 million movies, TV episodes and series. The data is collected through voluntary submissions of information by the people in the film industry and web site visitors. The accuracy of entries is verified by professional staff, who rely on press kits, official biographies, interviews, and on-screen credits (Internet Movie Database).

The sampling was not limited to any specific genre. I assumed that any kind of video production might be considered as a source of information, regardless of intended audience. The database was searched by language of dialogue. Table 4 summarizes the number of film/video titles for the most widely used languages in the industry for the period from 1990 to 2007. 2

Table 4. Film and video production by language.

Language Number of titles Percentage of total
English 158,611 34,89%
Spanish 23,256 5,12%
German 16,523 3,63%
French 15,171 3,34%
Japanese 7,811 1,72%
Italian 4,927 1,08%
Danish 3,967 0,87%
Dutch 3,445 0,76%
Portuguese 3,213 0,71%
Russian 2,715 0,60%
Hindi 2,357 0,52%

Web pages

Few attempts have been made to estimate language disparities on the Internet. Several reports on Internet statistics by language were released between 1997 and 2004 by Global-reach, an international marketing company; Alis Technologies, a Canadian company; Vilaweb, a Catalan electronic newspaper; and the educational web site Netz-tipp (Murray 35-36; Gorski and Clark 30-34; Web Languages Hit Parade). 3 These reports suggest that the proportion of English content has decreased overtime. In 1997, English web pages comprised 82.3 % of the World Wide Web, but in 2002 they comprised only 56.4 %.

Table 5 represents the latest statistics on language distribution on the Internet. The data was obtained by performing searches in Google and Alltheweb with switched linguistic filters for each respective language (Ebbertz).

Table 5. Distribution of languages on the Internet.

Language Web pages (millions) Percentage of total
English 1142,5 56,43%
German 156,2 7,71%
French 113,1 5,59%
Japanese 98,3 4,86%
Spanish 59,9 2,96%
Chinese (Mandarin) 48,2 2,38%
Italian 41,1 2,03%
Dutch 38,8 1,92%
Russian 33,7 1,66%
Korean 30,8 1,52%
Portuguese 29,4 1,45%

Literate population

To take advantage of the majority of information resources, it is important to understand textual images, in other words, to be literate. Traditionally, "literacy" is defined as the ability to read and to write short simple statements in any language. 4 From this perspective, it would be more appropriate to compare the summarized results of information production with the literate population rather than with the total number of native speakers.

The literacy data was obtained from the CIA statistics and from the "Ethnologue", a catalogue of world languages published by SIL International.5 Literate population was estimated based on the total population of each country, where a given language is spoken, and the literacy rate in this country (The World factbook; Gordon and Grimes). If more than one language is spoken in the country, the same literacy rate was applied for each language. When two sources contradicted each other, preference was given to data provided by the CIA.

Table 6. Literate population of the world.

Language Literate population Percentage of the world's literate population
Chinese (Mandarin) 794,947,565 14,68%
English 572,977,034 10,58%
Spanish 295,968,824 5,47%
Hindi/Urdu 230,560,488 4,26%
Arabic 229,444,922 4,24%
French 220,326,329 4,07%
Russian 194,503,049 3,59%
Portuguese 191,739,619 3,54%
Japanese 126,159,159 2,33%
Bengali 107,897,009 1,99%
German 93,969,555 1,74%

Findings and conclusions

The results for information production in different languages are shown in table 7. For each language, the average of the percentage of total information production in every format was calculated and compared to the percentage of the literate population in each corresponding language. As described above, the calculation was based on data derived from different sources, the accuracy and quality of which may vary depending on methods used for data collection. There are also chronological gaps. Not all data is available for the same periods of time.

Does it make the calculation unreliable? The answer depends on the purpose for which the statistics are to be used. I was primarily interested in finding comparative data rather than accurate numbers. My goal was to determine the most common languages in global information production and the proportion of information resources available in those languages. The possible errors do not significantly change the final conclusion.

Table 7. Information users and information production in most spoken languages.

Language Literate population Information production
English 10,58% 44,29%
German 1,74% 7,60%
Spanish 5,47% 5,91%
Chinese (Mandarin) 14,68% 4,85%
French 4,07% 4,21%
Japanese 2,33% 3,34%
Italian 1,09 % 2.16%
Russian 3,59% 1,96%
Portuguese 3,54% 1,68%
Dutch 0,43 % 1.67%
Korean 1,36% 1,20%
Hindi 4,26% 0,96%
Arabic 4,24% 0,43%
Bengali 1,99% 0,12%

As can be seen from Table 7, almost 78 % of all information in the world is produced in the following ten languages: English, German, Spanish, Chinese (Mandarin), French, Japanese, Italian, Russian, Portuguese, and Dutch. English dominates universal information space and constitutes more than 44 % of printed and electronic materials. German follows English and comprises 7.6 % of the global information production.

14.69 % of the world population is literate in Mandarin,the most spoken language in the world, but only 4.85 % of global information resources are produced in this language.

Other widely spoken languages include Hindi, Arabic, Bengali, and Korean. At the same time, the number of information resources in these languages is relatively small. For example, there are 230 million people literate in Arabic, which constitutes 4.24 % of the world's entire literate population, but only 0.43 % of all information is available in the Arabic language.

What do these numbers mean?

First of all, they can measure the importance of a particular language. Its rank is not necessarily related to a percentage of the literate population, but rather depends upon the level of cultural and economic development of the countries where the language is used.

Secondly, they underline the gap between the users of information and available information resources. They clearly show how the "language divide" contributes to the exclusion of countries and peoples from universal knowledge. This primarily concerns countries with low literacy rates and poor education. At the same time, the educated community tends to view English as a universal language. Many countries have special programs which encourage citizens to achieve proficiency in English (Weber).6

Nevertheless, we need to realize that more than half of the world's information resources are produced in non-English languages. These resources will likely continue to grow in the near future. The "Global Trends 2025" report, recently released by the US National Intelligence Council, projected the increasing role of Brazil, Russia, China and India in the world economy (BBC News). If this forecast is true, we may expect the rise of information production in Portuguese, Russian and Chinese.

This trend must be taken seriously by publishers and vendors in English-speaking countries, where non-English resources are largely ignored. According to the Bowker publishing group, only 3 % of all books available for sale in the United States are new translations from other languages (English-Speaking Countries). The term "language divide" can be equally applied to the English-speaking world.

There are many opportunities for librarians to respond to the challenges of multilingualism in information production. One of them is by providing equal access to information resources regardless of the language of origin. Today, however, most widely-distributed indexing databases cover primarily English content. Non-English materials are not fully searchable, and access to full-text electronic articles from non-English periodicals is not always available.

Future work

This paper has outlined the language profile of global information production. It should be considered a first step toward further research, which may focus on the following aspects:

  • Finding more comprehensive and more reliable statistics for information production in the most widely used languages.
  • Widening the scope of the research by taking into account as many formats of information as possible (archival documents, government publications, digital resources, audio recordings, etc.).
  • Broadening the definition of literacy and examining the role of multilingualism in literacy and information production.
  • Finding historical data, which helps us understand the evolution of information production in different languages during the last decades, or centuries. These statistics also help to determine if there is a trend towards a decline in the prevalence of the English language in various formats of information resources.
  • Qualitative analysis of the content of information produced in different languages.
  • Comparative analysis of languages of library collections around the world.


The early version of this research was presented as a poster session at the CLA Conference in Vancouver in May 2008. I would like to thank Edwin Perry, Liaison Librarian at the University of Regina, and two anonymous reviewers for their comments and suggestions.

[1] The data was retrieved in April 2008.

[2] The data was retrieved on 18 April 2008.

[3] The Vilaweb data is presented on the web site of the ClickZ Network, which is specialized on providing electronic marketing news, information, commentary, research, and reference services. See <http://www.clickz.com/showPage.html?page=408521>. The original Vilaweb report is no longer available online, however, it is often cited in scholarly literature. The data collected by Vilaweb is very close to the results obtained by OCLC, which conducted the Web Characterization research project in 1998-2002. The goal of the project was to analyze the size and content of the Web based on samples of publicly available resources. (O'Neill, Lavoie and Bennett; OCLC).

[4] This definition is used by UNESCO for collection literacy data around the world. See <http://www.uis.unesco.org/ev.php?ID=5013_201&ID2=DO_TOPIC>.

[5] The data was retrieved in April 2007. SIL International (Summer Institute of Linguistic) is a faith-based organization that studies, documents, and assist in developing of world's lesser-known languages. Its premier publication covers more than 6,900 living languages (Gordon and Grimes).

[6] George Weber ranked the top ten most influential languages as follows: English, French, Spanish, Russian, Arabic, Chinese, German, Japanese, Portuguese, Hindi. The results were based on the analysis of six main factors: number of primary speakers, number of secondary speakers, economic power of countries using the language, number of major areas of human activity in which the language is important, number of population of countries using the language, and socio-literary prestige (Weber).

