Hi, for a school project (exam actually), we've written a ternary trie for text-completion.
We fill it with about 400.000 lines of text (13 MB in total, as UTF-8).
Now, java represents characters in UTF-16 i believe.
So thats 26 MB, + another 26 for saving the complete strings in the "end" nodes.
so 52 MB, then there will some overhead, but I measure the consumption to be at least 500 MB.
The measurement is made using the free command (on GNU/Linux) before and while running the main method.

What am I missing?
I doubt they will ask us at the exam (in about two hours), but I'm curious and really annoyed about it.

Here is the code:

//The code in this file is based on the code from http://algs4.cs.princeton.edu/52trie/TST.java.html
package View;

import java.util.Queue;
import java.util.LinkedList;
import java.util.Scanner;
import java.io.File;
import java.io.IOException;
import java.io.FileNotFoundException;

public class TernaryTrie {

    Node root;

    /**
     * Build a trie from the standard path
     */
    public TernaryTrie() {
        this(new File("TrieData.txt"));
    }; 

    /**
     * Build the trie from the file denoted by the given string path
     * @param String path to file
     */
    public TernaryTrie(String fp) {
        this(new File(fp));
    }

    /**
     * build the trie from a file
     * @param String the file to build the trie from
     */
    public TernaryTrie(File f) {
        Scanner fileScan = null;
        try {
            fileScan = new Scanner(f, "UTF-8");
            while (fileScan.hasNext()) {
                String[] name = fileScan.nextLine().split(";");
                put(name[0], name[1]);
            }
        } catch (FileNotFoundException e) {
            throw new RuntimeException(e);
        } finally {
            fileScan.close();
        }
    }   

    /**
     * Retrieve the value correspoding to a given string.
     * if the given string can not be found, null is returned
     * @param String string to search for. (exact match).
     * @return String
     */
    public String get(String s) {
        Node n = get(root, s, 0);
        if (n== null) return null;
        return n.val;
    }

    private Node get(Node node, String s, int d) {
        if (node == null) return null;
        char c = s.charAt(d);
        if  (c < node.c) return get(node.l, s, d);
        else if (c > node.c) return get(node.r, s, d);
        else if (d < s.length() -1) return get(node.m, s, d+1);
        else return node;
    }

    /**
     * insert a key-value pair into the trie
     * @param String key
     * @param String value
     */
    public void put(String s, String v) {
        root = put(root, s.toLowerCase(), v, 0 );
    }

    private Node put(Node node, String s, String v, int d) {
        char c = s.charAt(d);
        if (node == null) node = new Node(c);
        if  (c < node.c) node.l = put(node.l, s, v, d);
        else if (c > node.c) node.r = put(node.r, s, v, d);
        else if (d < s.length() -1) node.m = put(node.m, s, v, d+1);
        else node.val = v;

        return node;
    }

    /**
     * Get the keys that are prefixed with a given String
     * @param String prefix
     * @return String
     */
    public Iterable<String> startsWith(String pre) {
        if (pre.length()==0) return null;
        Queue<String> q = new LinkedList<String>();
        Node n = get(root, pre, 0);
        if (n == null) return q;
        if (n.val != null) q.offer(pre);
        collect(n.m , pre, q);
        return q;
    }

    private void collect(Node node, String pre, Queue<String> q){
        if (node == null) return;
        collect(node.l, pre, q);    //swapped with next line to arange output lexiographically (well, sort of)
        if (node.val != null) q.offer(pre + node.c);
        collect(node.m, pre + node.c, q);
        collect(node.r, pre, q);
    }

    /**
     * Prints a 60 character wide chart of characters from ascii code 0 through 600
     */
    public void testChars() {
        for (int i=0; i<=600; i++){
            if(i%60==0) System.out.println();
            System.out.print((char)i + " ");
        }       
    }


    //helper
    private String cut(String s) {
        return s.substring(0, s.length()-1);
    }

    //private helper class
    private class Node{
        char c;
        String val;
        Node l, m, r;

        public Node(char c) {
            this.c = c;
        }

        public Node() {}
    }   

    /**
     * Mainly for testing purposes
     */
    public static void main (String[] args) {
        if (args.length < 1) {
            System.out.println("yeah, giving me no parameters is the way to go!\n" +
                    "What am I, a freaking guessing machine?");
                return;
        }

        System.out.println("Loading " + args[0] + ", please wait.");
        TernaryTrie tt = new TernaryTrie(args[0]);
        Scanner inputScan = new Scanner(System.in);
        while (true) {
            System.out.println("Waiting for a search-prefix...");
            String q = inputScan.nextLine().toLowerCase();

            if (q.substring(0, 2).equals("g:")) {
                q=q.substring(3);
                System.out.println("Search result for " + q + ":");
                System.out.println(tt.get(q));
            } else {
                System.out.println("Entries beginning with " + q + ":");
                for (String r : tt.startsWith(q)) {
                    System.out.println(r);
                }
            }
            System.out.println();
        }
    }
}

Edited 4 Years Ago by Bladtman242: Additional info

IIRC, the Scanner class is notorious for consuming insane amount of memory due to it's regex based implementation and other implementation details. Try replacing it with a simple BufferedReader when reading the file contents.

While that appears to have helped, it still uses about 300 MB?
Maybe I'm confused about the measurements though.
I use linux free and vmstat to measure used memmory.

I changed the reading to the following:

        public TernaryTrie(File f) {
                  try {
                          BufferedReader br = new BufferedReader(new FileReader(f));
                          String r = br.readLine();
                          while (r!=null){
                                  String [] name = r.split(";");
                                  put(name[0], name[1]);
                                  r = br.readLine();
                          }
                  } catch(IOException e){
                          throw new RuntimeException(e);
                  }
        }

free command is IMO not a good way to measure total memory in use since it also includes the "garbage" currently held by the VM before GC. Plus if no max or min heap sizes are provided, then the JVM will reserve a certain amount of memory in advance which depends on the architecture/OS. Also notice that it's not just 400K strings but also the assisting data structures like Node instances which are created during the lifetime of the application. If you want to be really sure, make sure you put a limit on the maximum memory used by your application by using the -Xmx JVM switch; initially set it to 100 MiB.

For proper profiling you need to either use the management extensions of Java (JMX memory bean) or an external tool like Visual VM which exists inside the JDK bin directory starting Java 5+.

Ooh, sorry I haven't answered, i was waiting for a notification email, but I guess I forgot to subscribe.
The lowest heapspace memmory it will run with is 270MB. Thats a lot better than the 500 i assumed, but it seems like a lot of overhead for ca. 50MB of data?
Are those profilers part of the JDK?

Yes, as already mentioned, they can be found in the bin directory of your JDK. Plus, let's not forget the overhead for each object, the native libraries linked to the JVM executable, the thousands of .class files loaded etc.

I couldn't find it. Turns out it wasn't included in my jdk.
Anywho, according to visualvm the instanves of the inner class Node takes up about 160MB, so i guess 200 MB is reasonable.
I just didn't imagine the objects and object references would be that significant.
Does et load a .class file per instance or per class?

I couldn't find it. Turns out it wasn't included in my jdk.

If you are using JDK 1.5+, the visualvm binary should be present in the bin directory; not sure why it isn't in your case.

Does et load a .class file per instance or per class?

.class files are loaded once per classloader. Unless you have a web application, in normal cases, all .class files will be loaded only once. But once you start loading around thousand classes with each having a dependency of their own, it doesn't take time for memory consumed to build up.

Edited 4 Years Ago by ~s.o.s~

If you are using JDK 1.5+, the visualvm binary should be present in the bin directory; not sure why it isn't in your case.

I'm running openjdk, guess I could have mentioned that :)

it doesn't take time for memory consumed to build up.

Evidently not. Well, many thanks for your help.
It would have taken me ages to find out on my own.

Edited 4 Years Ago by Bladtman242

This question has already been answered. Start a new discussion instead.