Distinct Substrings

Here is an approach to find the number of distinct substrings of a given string:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
def distinctSubstrings(s):
  substrings = set()

  # Generate all substrings
  for i in range(len(s)):
    for j in range(i+1, len(s)+1):
      sub = s[i:j]
      substrings.add(sub)

  return len(substrings)

The key steps are:

  1. Iterate through every start index i
  2. Generate substrings from i to all end indices j
  3. Add the substring to a set to avoid duplicates

This generates all substrings by sliding a window from size 1 up to the full string length. Using a set eliminates repeats.

The time complexity is O(N^2) since we loop from i = 0 to N and j = i+1 to N for each i.

Space complexity is O(N^2) as well since in worst case all substrings are distinct so we store N^2 strings.

To optimize, we could stop early if set size reaches N^2 since we know all substrings must be unique.

Here is an implementation in Java to find the number of distinct substrings:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
class Solution {
  public int distinctSubstrings(String s) {
    Set<String> substrings = new HashSet<>();

    for (int i = 0; i < s.length(); i++) {
      for (int j = i+1; j <= s.length(); j++) {
        substrings.add(s.substring(i, j)); 
      }
    }

    return substrings.size();
  }
}

And here is the implementation in C++:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
#include <string>
#include <unordered_set>

using namespace std;

int distinctSubstrings(string s) {

  unordered_set<string> substrings;

  for (int i = 0; i < s.length(); i++) {
    for (int j = i+1; j <= s.length(); j++) {
      substrings.insert(s.substr(i, j-i)); 
    }
  }

  return substrings.size();
}

The key aspects remain the same - generate all substrings by sliding a window, use a hash set/unordered_set to remove duplicates, and return the final set size.

Overcoming TLE

There are a few ways we can optimize the brute force approach to avoid exceeding time limits:

  1. Memoization

Store results of substrings in a hashmap. Before generating a substring, check if we’ve seen it before and skip re-adding to the set if found. This avoids recalculating duplicate substrings.

  1. Output just count, not full set

Rather than building the full set of substrings, just increment a counter whenever we generate a new substring. Return the final count, without keeping the full set in memory.

  1. Prefix/Suffix hashing

Use a hashmap to store prefixes and suffixes of the string. This allows quick lookup to rule out substrings that couldn’t possibly be unique or new based on matching existing prefixes/suffixes.

  1. Loop optimizations

Avoid nesting the entire inner loop over all substring lengths at each index. For example, increment inner end point j in chunks of size k to reduce number of iterations.

  1. Parallel processing

Process different parts of the string in parallel as threads/processes to divide-and-conquer the workload.

Combining techniques like memoization, better output representation, loop optimizations, and parallelism can help overcome the exponential time complexity in many substring problems.

We can optimize the substring problem further using Ukkonen’s algorithm:

The key ideas in Ukkonen’s algorithm are:

  • Build suffix tree for the string

  • Each suffix tree node represents a substring

  • Leaf nodes are suffixes, internal nodes are repeated subsequences

  • Traversing the suffix tree visits each substring exactly once

This allows generating all substrings in linear time O(n) by constructing and traversing the suffix tree, rather than the naive O(n^2) approach.

Here is an implementation:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
def distinctSubstrings(s):

  tree = SuffixTree(s) 
  substrings = []

  # Traverse tree depth-first
  def traverse(node):
    if node.is_leaf:
      substrings.append(node.substring)  
    for child in node.children:
      traverse(child)

  traverse(tree.root)
  
  return len(set(substrings))

By leveraging the suffix tree structure, we can greatly improve efficiency by avoiding generating duplicate and overlapping substrings in the naive approach.