Hermetic Word Frequency Counter 8.21
Hermetic Word Frequency Counter 8.21 Ranking & Summary
Hermetic Word Frequency Counter 8.21 description
Hermetic Word Frequency Counter 8.21 is an application used for scanning a text file, or text on the clipboard; and counting the number of occurrences of the different words (optionally ignoring common words such as this). The words found can be displayed according to alphabet or frequency, with rank and frequency shown for each word.
The term 'word' usually means a word in a natural language such as English or French, but for this software it has an extended meaning: Any sequence of characters consisting of letters from a European language plus (optionally) hyphens, numerals, underscores, colons, periods, apostrophes, @-signs, and forward and backward slashes. Thus not only can the text being scanned be in a language other than English, it can even be in a computer language such as C. The software even allows you to count words which include @-signs (if you are interested in email addresses).
There is also an Advanced Version which does everything the standard version does and also is able to scan not just one file but all files in a folder, and optionally in all subfolders, and to return a single report on the frequencies of words in all files scanned.
- Scannable Files
- The file upon which the program acts can have any filename extension, but it must consist almost entirely of text characters (either 8-bit text or 16-bit Unicode text). More exactly, it must consist only of characters with single-byte values in the range 32 through 255, except for whitespace characters: linefeeds (byte value 10), carriage returns (13), tab characters (9), backspaces (8) and page breaks (12) — except that (i) Unicode text has zero bytes and (ii) up to 0.1% of the bytes (other than zero bytes in Unicode text files) are allowed to be "anomalous bytes", that is, bytes with values less than 32 but which are not whitespace characters. This exception is due to rare cases where a large text file will, for some reason or another, contain a few anomalous bytes (which should thus not prevent the program from treating the file as a text file).
- This program does not work correctly with UTF-8 text files which have non-English letters. These should be read using WordPad and resaved as Unicode text files.
- The input file would typically consist of natural language text (English, German, Spanish, etc.), but need not; it can consist of program code (e.g., a C++ source file) or can be an HTML or an XML document.
- Files containing non-displayable characters, such as documents written with MS-Word and Adobe Acrobat, cannot be processed by reading the file directly. For files such as this either (a) save the file as a standard ASCII text file and apply this software to that file or (b) open the document in Word (or whatever is the appropriate application), select all the text and copy it to the clipboard, then Count word frequencies with clipboard selected as the source. (There is a limit on the number of characters in the text on the clipboard — 100,000 — so for large files (a) must be used, if possible.) The text in the clipboard and can be pasted into the textbox before (or after) the words are counted, but need not be. When clipboard is selected as the source the program counts the words in the text on the clipboard, not the words in the text in the textbox.
- To repeat (so as to make this clear): The program does not count words in the textbox, only words either in a specified input file or words in text on the clipboard. You may compose text in in the textbox, but to do a word count on this you must first copy it to the clipboard. That's one reason there is a Copy to clipboard button (which is available only after the software has been activated).
- Setting the Operation Parameters
- The concept of counting words may seem simple, but is not. What is a word? Is double-click one word or two? Is don't a word? Is cat the same word as Cat? Do you want to count all words? Including common words such as this, with and him? This program allows you to customize its operation so that just the words are counted in which you are interested, and, as noted above, words may (if you wish) include hyphens, apostrophes, etc.
- The Reinitialize button sets all parameters back to the way they were when the program was first run. If any problem develops in the functioning of the software then reinitializing the program might fix it.
- If you wish to treat an email address as a word then check the boxes for at-signs, periods, hyphens and underscores. If you wish to treat a URL as a word then check the boxes for colons, forward slashes, periods, hyphens and numerals. (Note that if a word may contain a forward slash then a double forward slash cannot be used as a start-comment marker. The software checks for conflicts such as this.)
- Parameters set using this screen may be saved at any time (using the Save state button on the main screen) so as to be restored on the next run.
- You can also save a set of parameters to a parameter file (which must have extension .wfc), and reload it later. This allows you to keep several different parameter sets at hand for working with different kinds of files (e.g., text in different languages).
- A word cannot begin with a numeral, a hyphen, an apostrophe or a colon, but may begin with an underscore (_).
- Non-English Text
- Hermetic Word Frequency Counter may be used with text in languages other than English, including German, French, Italian, Spanish and Portuguese — in fact, any language with characters that can be encoded in WinLatin1 a.k.a. Windows 1252. The program also works with text encoded using Unicode. (As noted above, this program does not work correctly with UTF-8 text files which have non-English letters. These should be read using WordPad and resaved as Unicode text files.)
- The option for dropping a final 's' unless it is preceded by an 's' or a vowel is intended to allow the conflation of single and plural nouns in English (e.g., 'dog' and 'dogs'). This option also helps to conflate German nouns with their genitives, e.g., 'Bewußtsein' and 'Bewußtseins'. But this option may have unintended consequences, so it is better to leave it unchecked unless results of a scan suggest that it should be used.
- Rank and Frequency Display
- The 'rank' and 'frequency' values may each be included in, or excluded from, the displayed results.
- If the output file consists only of words, with no rank or frequency values, then you can get these either as a list (one word per line) or as comma-separated. This is done by making the appropriate selection in the Display format drop-down menu.
- Ignoring Common Words
- You can tell the program to ignore common words, such as 'the', 'and', etc. These words are contained in a file of your choice. When this file has been specified and Ignore common words in file is checked the program will ignore any words in the text which it finds in the specified file.
- If just a few special words are to be ignored then they can be specified in the Ignore these words textbox, as shown above.
- Six files are provided containing common words in English (cwds_en.txt), German (cwds_de.txt), French (cwds_fr.txt), Italian (cwds_it.txt), Spanish (cwds_es.txt) and Portuguese (cwds_pt.txt). These files are in the folder containing the program files (created during program installation), and there is a download link in the Windows Explorer program menu after installation. You can add or remove words as you wish, and words do not have to be in alphabetical order or on separate lines (but the file must consist only of text).
- Embedded Comments
- An input file may contain "comments", which are any parts which are to be skipped over when counting words. The beginning of a comment is marked by start-comment characters specified in the Set parameters screen, and the end is marked by end-comment characters. If the end-comment marker is empty (blank) then the comment ends at the end of the line.
- The use of start-comment and end-comment markers also makes it possible to exclude sections of the input file from the word-counting process.
- It is possible to specify two pairs of start-comment and end-comment markers. This allows both single-line comments and multi-line comments in the same input file.
- If the input file has one of the following extensions: htm, html, shtml, xml and php, then the start-comment and the end-comment markers are automatically set to < and > respectively. This means that HTML, XML and PHP tags are ignored. It also means that /* and */, or other markers, cannot be used as start-comment and end-comment markers in files with these extensions. If the start-comment and end-comment markers were set to something else then the original settings are restored after a file of one of these file types has been processed.
- C-style comment markers (/* ... */) can be used in the files of common words to temporarily disable sections of those files (so that the words within those sections are not treated as common words, but are counted when they occur within the input file).
- Input File Size & Output to a File
- There is no limit on the size of an input file. The program has been tested with text files up to 1 Mb in size, and with files containing over 11,000 different words. In such cases processing of the text may take some time, and for these cases a progress bar is provided.
- There is, however, a limit on the amount of text which can be held in the output textbox, either by pasting from the clipboard or as a result of listing words found. This does not prevent Hermetic Word Frequency Counter from being able to handle large files. For example, there may be a file on your PC named Win32api.txt. This is about 652 Kb in size and has over 80,000 instances of about 11,000 different words. When the program is run on this file, with the Don't display words as found option unchecked, words found as the program goes through the file will be displayed until 2000 words have been displayed, at which point further words are not displayed so as to avoid a buffer overflow. After the entire file has been processed, the words found will be listed until the capacity of the output textbox buffer is reached. If the words are listed in alphabetical order then (in the case of Win32api.txt) only words beginning with a, b, c or d are listed.
- In order to obtain a complete listing of the words in this file you have to specify an output file before starting the word count. In this case the complete listing is written to the output file before the listing is given in the output textbox. The displayed listing will still stop with words beginning with d, but the entire listing can be viewed by opening the output file in some text editor such as WordPad.
- Hermetic Word Frequency Counter has been used successfully with large files with many different words. In one case a 4.12 Mb file with 46,398 different words, and in another a 12.1 MB file with 61,979 different words (and a total of 1,847,893 instances of these words).
- Transfer of Results to an Excel Spreadsheet
- The output can easily be transferred to an Excel spreadsheet as follows: If the output has not already been written to an output file then copy the output to the clipboard, paste it into some text editor such as Notepad, and save it as a .txt file. Load this into Excel, which will automatically detect the columns.
- If you specify an output file then the results will be written to that file. In the Set parameters panel you can specify that the output should be written as comma-delimited, so that the file can be read by some statistical programs that (unlike Excel) cannot detect fixed-width fields.
- Counting Words in One File Which are Not in Another
- It is unlikely that many people will wish to make use of this possibility, but if so:
- If you want to get a list of all the non-common words (subject to the parameters setting) which occur in a file (say, File B) which do not occur in another file (say, File A), then here's how:
- Specify File A as the input file.
- In the Parameters panel select (if not already selected) the usual common words file (for English, cwds_en.txt).
- Set word order to 'Alphabetical' and display format to 'Word'.
- Specify an output file (say File C) with a .txt extension.
- Count word frequencies (this creates File C).
- Open File C in a text editor and delete the first line, "Word (longest has n characters)".
- Open the common words file, delete the four comment lines at the top, select all, paste the common words at the end of the words in File C, and save the file under the same name.
- Back at the program specify File B as the input file.
- In the Parameters panel select File C as the common words file.
- Specify an output file (say File D) with a .txt extension.
- Count word frequencies.
- File D will then contain all words (except for the usual common words) which occur in File B but which do not occur in File A.
- File C can then be used with further files B2, B3, etc., to find all non-common words which occur in File B2 but not in File A, in File B3 but not in File A, and so on.
- Note that the results must be seen in the context of the parameter settings. For example, if hyphens are permitted in words then "comma" may appear in File D (indicating that it occurs in File B but not in File A) even though "comma-separated" occurs in File A. This is correct provided that "comma" occurs by itself (i.e., other than in a hyphenated word) in File B but does not occur by itself in File A.
- Improvements to demo version.
- The minimum screen resolution required is 800x600, but 1024x768 is recommended.
Hermetic Word Frequency Counter 8.21 Screenshot
Hermetic Word Frequency Counter 8.21 Keywords
Bookmark Hermetic Word Frequency Counter 8.21
Hermetic Word Frequency Counter 8.21 Copyright
Want to place your software product here?
Please contact us for consideration.