Phrase Compare Report

Analyze key words and phrases.




The Phrase Compare Report generates key words and phrases (n-grams). Use this report to find common (and uncommon) words and phrases, including overused phrases, important themes, and other textual features.

Calculate n-grams by analyzing one text, by comparing two texts, or by comparing a text with a frequency list.

How to Use:
  1. Open book(s).
  2. In the WordCruncher toolbar, go to Analyze > Book Reports > Phrase Compare Report (N-grams).
Phrase Compare (N-Grams)


N-Grams Within One Text

Generate the words and phrases (n-grams) that occur in a single text. View repeated and non-repeated n-grams.


How to Use:
  1. Book 1 should show the book name (Book 2 should be set to None).
    Optional: Add bounds to specify a section of the book.
  2. Select a Word list from the drop-down box.
  3. Select Options. (Setting a max phrase length of 9 will calculate words and phrases from 1–9 words long.)
  4. Click Compare.
View Results
Length Drop-down

View n-grams of different lengths.

Phrases Drop-down

Repeated: All n-grams that occur two or more times.

Not repeated: All n-grams that occur only once.

N-grams within one text


Compare Two Books

Generate words and phrases (n-grams) that occur across two texts. Compare the texts, identifying phrases that occur more in one text than another, phrases that only occur in one text, and phrases that occur in both texts.


How to Use:
  1. Under Book 1, select the first book. This will be labeled as 1 in the report.
  2. Under Book 2, select the second book. This will be labeled as 2 in the report.
  3. Select Options. (Setting a max phrase length of 9 will calculate words and phrases from 1–9 words long).
  4. Click Compare.
View Results
Length Drop-down

View n-grams of different lengths.

Phrases Drop-down

Choose the n-gram comparison (e.g., 1>2 will show phrases that occur more in Book 1 than in Book 2).

Compare two books


Compare Two Sections of One Book

Generate words and phrases (n-grams) that occur across two sections of one book. Compare the sections, including phrases that occur more in one text than another, phrases that only occur in one text, and phrases that occur in both texts.


How to Use:
  1. Book 1 should show the book name. Use bounds to select the first section. This will be labeled as 1 in the report. For example, use the table of contents bounds:
    1. Select the Bounds drop down > Table of Contents Bounds.
    2. Select checkbox(es).
    3. Click OK.
  2. Under Book 2, select the same book.
  3. Under Book 2, use bounds to select the second section. This will be labeled as 2 in the report.
  4. Select Options. (Setting a max phrase length of 9 will calculate words and phrases from 1–9 words long.)
  5. Click Compare.
Examples:

The five-word phrase (n-gram) and it came to pass occurs significantly more often in Genesis and Exodus than it appears in Leviticus, Numbers, and Deuteronomy. This may suggest different types of writing (e.g., historical narrative) in the texts.

In the TED Corpus (English), the n-grams pandemic, vaccine, and public health are especially frequent in 2020 when compared against other years (such as 2019).

Compare two sections of the Scriptures


Compare to an External Frequency List

Instead of comparing two books, compare a book to an external word/phrase frequency list.


How to Use:
  1. Book 1 should show the book name. This will be labeled as 1 in the report.
  2. Under Book 2, select the Name drop-down > Open a Phrase Frequency List. This will be labeled as 2 in the report.
  3. Select a frequency list.
  4. Click Compare.
COCA Frequency List

Try out the Corpus of Contemporary American English (COCA) word/phrase frequency list (n-grams up to 5 words long). While this file is just a sample of the entire corpus, the relative frequencies provide all the information necessary to generate n-gram data.


Create a Word/Phrase Frequency List

A word/phrase frequency list can be formatted as a tab-separated CSV or TXT file. Download sample lists from GitHub.

Note: If you are using a TXT, add an empty line at the end of the frequency list file.

Required columns

There are 3 required columns for your frequency list:

  • Column 1: Len – The n-gram length (number of words).
  • Column 2: Freq – The frequency of the word/phrase.
  • Column 3: Phrase – The word/phrase.
Example:
Len Freq Phrase
1 3,678 and
2 89 and it
3 62 and it came
Additional column flags

To indicate any of the following settings, add column headers in order. Leave a blank column for each that you skip. If the column header exists, a flag will be set.

  • Column 4: Ignore case – Ignore case was selected when calculating n-grams.
  • Column 5: Ignore diacritics – Ignore diacritics was selected when calculating n-grams.
  • Column 6: (The title of your book) – The name of the book the frequency list was generated from.
  • Column 7: (The Total Frequency of Relative Corpus) – For frequency list samples.
    By default, the report assumes the total frequency is based on the 1-grams in your frequency list. If your frequency list is a sample of the complete list, label the total frequency of all 1-grams in the complete list.
Examples:
Len Freq Phrase Ignore Case Ignore Diacritics Scriptures
1 3,678 and
2 89 and it
3 62 and it came
Len Freq Phrase Scriptures 38,262
1 3,678 and
2 89 and it
3 62 and it came


Phrase Compare Report Columns

The Phrase Compare Report generates a table with data about each n-gram, its frequency, and more.

Num

The order of results.

Len

The length of the n-gram (number of words).

Phrase

The n-gram itself.

1.Freq and 2.Freq

The frequency of each n-gram in Book 1 (1.Freq) and Book 2 (2.Freq), if applicable.

1.RelF and 2.RelF

The relative frequency (RelF) per million words, a normalized measure, in Book 1 (1.RelF) and Book 2 (2.RelF).

Relative frequencies estimate how many times each phrase would occur if both books contained exactly one million phrases. For example, if a phrase occurred 10 times in 100 words (10%), and 10 times in 1000 words (1%), the raw frequencies are the same, but the relative frequencies are 100,000 and 10,000, respectively.

1.Exp and 2.Exp

The expected frequency of each n-gram in Book 1 (1.Exp) and Book 2 (2.Exp).

Expected frequencies are based on the probability of a result multiplied by the number of tries. For example, if you flip a coin 10 times, the expected number of heads would be 5 since the probability of heads is 50%. For a more detailed explanation of expected frequency, visit Phrase Compare Statistics.

BIC

Bayesian Information Factor (BIC): the main statistic used for comparing phrases.

BIC identifies phrases that do or do not occur statistically more often in Book 1 or Book 2. Significant phrases are calculated based on a book’s relative frequencies rather than its raw frequency.

List BIC
1 > 2 BIC ≥ 2
2 > 1 BIC ≤ -2
1 ≈ 2 -2 < BIC < 2

Each list is sorted by the SMP100 column.

SMP100

The Simple Maths Parameter (SMP) compares the relative frequency of each word or phrase in two texts.

For more information about the columns and statistics in the Phrase Compare Report, visit Phrase Compare Statistics.



Type-to-Token Ratio (TTR)

Type-to-token ratio (TTR) is a measure of lexical diversity. It assigns a number to the richness of a text's vocabulary.

  • Types: The number of unique words.
    There are four types in the phrase The Cat in the Hat.
  • Tokens: The number of total words.
    There are five tokens in the phrase The Cat in the Hat.

The higher the ratio, the wider the vocabulary.

Examples:
  • J.K. Rowling’s Harry Potter series has a TTR of 0.023 because Rowling uses about 25,500 unique words over a 1-million-word series.
  • Brandon Sanderson’s The Way of Kings has a TTR of 0.04 because Sanderson uses about 15,000 unique words over a 380,000-word book.
Calculating TTR

Type-to-token ratio data is calculated when you run the Phrase Compare Report. Click on the button to view summarized TTR statistics.

To access detailed TTR data, export the data to a CSV or TXT file. This format can be processed by other programs including our Type-to-Token Visualizer web tool, which displays a MATTR (moving average type-to-token ratio) visualization.


How to Use:
  1. Run the Phrase Compare Report.
  2. Click Save results > Export Type-to-Token Ratio files.
  3. Choose one of the following:
    1. Segments report: A detailed TTR report from word-to-word. Points of significance within a text can be identified at the word level.
      By default, a segment is 1000 words long. To change this, adjust the Words in TTR segment before you run the Phrase Compare Report.
    2. Levels report: A summarized TTR report for each reference level of the book (the table of contents). Rather than looking at changes in the MATTR (Moving Average Type to Token Ratio) from word to word, this provides the MATTR for each reference level.
      1. Choose whether to export the summary report or the complete report.
        The summary report excludes the lowest level references (e.g., paragraphs or verses).
  4. Name the file and click Save.
Type-to-Token Vizualizer Web Tool
MATTR Visualizer Web Tool

For more information about TTR, see Phrase Compare TTR Statistics.



Export

Export data from the Phrase Compare report, including n-grams, frequency lists, and type-to-token ratio.

Export a Table

Export the current table. If you want multiple tables of n-gram data, export them individually or copy each table using Ctrl + C.

How to Use:
  1. Run the Phrase Compare Report.
  2. Click the Save results drop-down > Export All (Phrase Compare).
  3. Name the file and click Save.
Export a Frequency List

Generate a complete n-gram frequency list for all n-grams. If you calculate the n-grams with a length of 5, you’ll get n-grams with a length of 1, 2, 3, 4, and 5.

How to Use:
  1. Run the Phrase Compare Report.
  2. Click the Save results drop-down > Export Phrase Frequency List.
  3. Choose Book 1 or Book 2.
  4. Name the file and click Save.