How does Google Scholar indexing work?

Jump to your question: 

What is Google Scholar and what content does it index?

Like Google, Google Scholar is a crawler-based search engine. Crawler-based search engines are able to index machine-readable metadata or full-text files automatically using “web crawlers,” also known as “spiders” or “bots,” which are automated internet programs that systematically “crawl” websites to identify and ingest new content.

Google Scholar includes content across academic disciplines, from all countries, and in all languages. Google Scholar has access to all of the crawlable scholarly content published on the web with the ability to index entire publisher and journal websites as well as the ability to use the citations in the articles it has indexed to find other related content (you can read more about this on their website here). 

Please note: Google Scholar only indexes published content — your journal homepage, author guidelines page, etc. — will not appear in Google Scholar's search results.

How do articles get indexed by Google Scholar?

In order for your journals to be considered for inclusion in Google Scholar, the content on your website must first meet two basic criteria:

  1. Consist primarily of journal articles (e.g. original research articles, technical reports)
  2. Make freely available either the full-text or the complete author-written abstract for all articles (without requiring human or search engine robot readers to log into your site, install specific software, accept any disclaimers etc.). This criteria is met by default when you publish on Scholastica's Open Access platform.

From there your journal website and articles will have to meet certain technical specifications as Google Scholar uses automated software, known as "parsers", to identify bibliographic data of your papers, as well as references between the papers. 

Incorrect identification of bibliographic data or references will lead to poor indexing of your site. Additionally, some documents may not be included at all, some may be included with incorrect author names or titles, and some may rank lower in the search results, because their (incorrect) bibliographic data would not match (correct) references to them from other papers. To avoid such problems, Scholastica provides bibliographic data and references in a way that automated "parser" software can process it.

How does Scholastica help ensure that my articles are indexed by Google Scholar?

Scholastica’s OA publishing platform is configured to export bibliographic data in HTML "<meta>" tags (e.g., citation_title) which is the structure that Google Scholar prefers for their indexing.

These are the most common tags that are provided in the XML for each article that is published on your journal website : 

citation_title

citation_article_type

citation_publisher

citation_journal_title

citation_doi

citation_journal_abbrev

citation_publication_date

citation_author

citation_firstpage

citation_abstract

citation_copyright

citation_pdf_url

citation_issue

citation_volume

These tags will only be present if you have provided the specific citation information during publishing, however, exclusion of these tags will not prevent your article from being indexed. For example, if your article does not belong to an issue or volume, those tags will not appear in the article XML, but the article will still be indexed.

How long will it take for my articles to be indexed?

Google Scholar indexing isn't guaranteed, nor is it immediate. Crawling your published article can take approximately 6-8 weeks from the date you publish.

Searching in Google Scholar is as straightforward as searching in Google. Entering the title of the article into the search bar is the most straightforward method of searching. For more detailed instructions and tips, click here to visit Google Scholar’s help site and learn more. 

My article doesn’t appear in Google Scholar - What should I do?

As mentioned above, Google Scholar generates its entries using an automated web crawler that compiles information about an article from sources across the internet (read more here). This process is proprietary, unpredictable, and tends to be slower than the Google Search crawler. Additionally, Google Scholar offers limited ability to the public to trigger or control when and how articles are indexed and no way to manually adjust the data for an entry if it appears incorrectly or not at all. 

At Scholastica, we ensure that every article published through our system has rich metadata that is readily available online (see the meta tags we use above). This way, when the web crawler arrives it encounters the right information to successfully index your article. 

Sometimes, for unknown reasons, articles fail to be indexed in the expected timeframe. If this happens, start with the timeframe, meaning : How long ago did you publish this article? Remember, Google estimates that it can take 6 - 8 weeks for a published article to be indexed. You may just need to wait a little longer. 

You can learn more about Google Scholar's troubleshooting recommendations on their website here. To help set your expectations, Google Scholar's team is small and as a result it may take weeks or months to receive a response. If you are still experiencing difficulties, you can reach out to Google's Help Center for assistance.

My articles show up in Google Scholar, but not how I want (e.g. my website isn't the top result or the HTML version is missing) - what can I do?

According to Google Scholar "the most common cause of indexing problems is incorrect extraction of bibliographic data by the automated parser software." Since Google Scholar's algorithm for selecting the version that is indexed and linked is proprietary, publishers and hosting platforms have little control over how results are shown.

From Google Scholar : "The best way to fix incorrect bibliographic data is to provide it in a computer-readable form in the meta tags, as described in the indexing guidelines. Keep in mind that, since these papers are already included in Google Scholar, updating their bibliographic data will usually take 6-9 months from the time you provide it on your website."

If you've changed article hosting in the last 1-2 years and are seeing stale results, then per Google Scholar's troubleshooting notes "updates of papers that are already included usually take 6-9 months" - but could take longer.

In the case that your Scholastica website does not show as the top result, remember that Open Access (OA) articles are often available from many different sources. So, for example, an OA article might be found by Google Scholar's crawler in multiple places:

  • On the publisher's website
  • In an institutional repository (or multiple)
  • On Researchgate.net
  • On an author's personal website

Google Scholar might find all of these versions of the article and then decide how each version is indexed, and which is shown as the primary version in search results. This is ultimately not controlled by the publisher or hosting platform, but by Google Scholar.