Implementing Code-Completion for VS Code with ANTLR

Fernando
Fernando

How do you implement code-completion (aka "intellisense") for Visual Studio Code? and if you have an ANTLR grammar for your language: Can you take advantage it for code-completion?

Below I share what I've learnt implementing code-completion (on VS Code) for SOQL (Salesforce's Object Query Language), a SQL-like language.

Main Challenges

I'll break the problem down into three main issues to solve:

  • How to implement the "language intelligence" to detect which kind of data the user might need at the current cursor position, to offer useful completion items
  • How to write a VS Code "plug-in" to add code-completion for a new language?
  • How to obtain the external data items (say, "table" names) from the users's environment?

The language "intelligence" for code-completion

The crux of the problem: How to understand the current code and, based on the position of the cursor, propose useful completion text to the user?

The basic idea is to parse the code as the user types, and figure out which are the possible language keywords or syntactic elements allowed at the current cursor position. For example, imagine the user is typing the following SQL snippet (imagine the | character representing the cursor position):

SELECT Id |

a that point, we want to propose the user with the FROM keyword, and maybe , (a comma) to add more columns to the SELECT clause. After completing with FROM, we want to propose the user with the list of table names to select from:

SELECT Id FROM |

In our case, we had an ANTLR grammar for the SOQL language. Can we leverage it to identify which kind of elements to propose for code-completion?

The generated ANTLR parser can give us information about the next expected tokens. This is useful to propose keywords like FROM on the first example above. However, in those examples, a typical grammar could expect a token of type IDENTIFIER for both column names and table names. Knowing the possible token types is not enough.

To implement useful code-completion, we want to know which grammar rules could be valid at the current cursor position. If we had access to the grammar rules at runtime, we could start at the top-level rule and "walk down" the tree to identify possible active rules at the cursor position.

So, what we need is a model of the grammar rules in memory, as data. The problem is that parser generators spit out the code of a parser that recognizes the language of the given grammar, but the parsing rules are "lost" in the process.

I found this great article: Universal Code-Completion using ANTLR3 in which the author explains the limitations of using a generated ANTLR3 parser, and how they've solved the problem by parsing the ANTLR grammar itself into memory. The articles goes into great details of their implemention.

Luckily, with version 4, ANTLR now exposes an "ATN" (Augmented Transition Network) generated from the grammar, and makes is available on the generated parsers. From another article on Building autocompletion for an editor based on ANTLR :

... ATN is an internal structure used by the ANTLR parser. It represents an automata that can change state through epsilon transitions or when a certain token is received. We will start from the initial state and process the tokens we have in front of us until we reach the special token representing the caret.

So, ATN is a fancy name for a set of inter-related state machines to recognize a language. A "network" of finite state machines (FSM). We start the parsing process at the initial state (of the top-level grammar rule), and as we consume each token from the input, the ATN tells us which transition(s) can be followed to the next state(s) (or sub-state machines). "Epsilon" be followed without consuming any token.

Here's a (simplified) representation of a piece of the ATN from our SOQL ANTLR grammar. The red dotted line showing the path through the FSMs states when parsing "SELECT Id FROM ". ANTLR generates a FSM for each rule in your grammar, here we see how from the higher-level rule (first row) we jump to the FSM of FromClause (second row) and then FromExpr (third row):

ANT Example

(the FSM for rule SelectClause is not shown).

NOTE: ANTLR generates these FSM diagrams for you. It generates a separate one for each grammar rule. The picture above is the result of manually stitching together (and simplifying) three of the generated diagrams.

Enter ANTLR C3

Good news! Mike Lischke, author of the great article I mentioned above about completion with ANTLR v3, later wrote antlr4-c3 (ANTLR Code Completion Core): a TypeScript library based on the ideas presented on his article, but this time taking advantage of ANTLR v4's ATNs.

I've played with it and got a working prototype pretty quickly. We later decided to build upon that initial prototype for the "real" implementation. ANTLR C3 simplified our solution and allowed us to provide accurate completion candidates driven by the grammar.

The library is well documented. Once initialized, you invoke collectCandidates(tokenIndex) to obtain the list of tokens and grammar rules that apply at the current token index position.

The biggest challenge for me was how to properly implement a function to convert the current cursor position (charanter position in the text) to the position of the next expected token (on the input token stream) to pass to collectCandidates(tokenIndex). The documentation explains why this is not a trivial problem. One important aspect to consider is whether the grammar retains whitespace on its output tokenstream: if it doesn't, you'd better change it or else you won't have enough information to resolve some cases. Here are some examples of what I mean:

  • SELECT name| FROM xyz: Cursor touching the previous identifier. We want to continue completing that prior token position, in this example, column names. It should propose all column names that start with name (actually, on VS Code, you can return all column names and it will only show the user those starting with name)
  • SELECT name |FROM xyz: Cursor NOT touching the previous identifier token. We want to complete with what can come next after the previous token
  • SELECT name | FROM xyz : Cursor surrounded by spaces: Similar to the previous case. However, C3 expects the index of a token that is NOT whitespace, so here we must return the index of the next token after the whitespace.
  • SELECT name | FROM xyz : Cursor within a block of multiple spaces. Not exactly the same as the previous case: You must take into account how your grammar treats multiple contiguous spaces: does it return a single WHITESPACE token to the stream, or multiple ones?

You will likely need to adapt your grammar to make good use of it for completion. And you will surely need to write special code to handle quirks and limitations in your grammar, for example, to account for semantic rules not captured by the grammar. For instance, the SOQL grammar accepts arbitrary levels of nesting of SELECT statements, but the language actually supports only 2 levels.

You can see the implementation and usage of our findCursorTokenIndex() function here.

Custom Parser Error Recovery Strategy

So far we've been assuming valid input from the user, and ignoring the code that is at the right of the cursor. For many real situations, we have to parse invalid input, even after the cursor position. This is particularly tricky on SQL, which is arguably written in reverse: to know which are the possible columns names on the SELECT clause, we must first know what table to select FROM.

For example, on the snippet below we want to propose the user with columns of table Account:

SELECT Id, | FROM Account

The challenge here is that to extract the table name off the FROM clause, we must parse the whole text, which is currently invalid for our language: the SELECT clause is incomplete; it doesn't parse.

ANTLR has sophisticated error-recovery logic, which is great for reporting meaningful errors while parsing. It tries to insert fake tokens or delete unrecognized tokens from the input until it finds one that fits the expected token set.

In some cases, however, ANTLR's default error recovery might work against our goal. On the example above, at the cursor position, our grammar expects an identifier (for a column name) so ANTLR drops the FROM token (which is unexpected at that point) and then consumes Account as a suitable column identifier. Here's roughly how the parse-tree looks like for that last example:

Parse Tree

As you can see, ANTLR doesn't identify FROM Account as being part of the From Clause: it consumes those tokens to fulfill the Select Clause. Thus, we cannot extract the table name.

Can we tweak ANTLR to be less smart here? Can we stop it from "swallowing" tokens? (at least the FROM token?). Turns out the error strategy of ANTLR can be customized. After some experimentation, I was able to improve the outcome for this particular use-case: extending DefaultErrorStrategy to disable token deletion:

export class SoqlCompletionErrorStrategy extends DefaultErrorStrategy {
  ...
  protected singleTokenDeletion(recognizer: Parser): Token | undefined {
    return undefined;
  }
  ...
}

With this new strategy, the parse-tree looks like the following:

Parse Tree

Which allows us to parse the table name and thus provide useful completion of column names on those cases.

There are trickier cases however. Here's one:

SELECT AVG(|) FROM Account

Here AVG() fails to parse (because it's missing an argument in between the ()), but the default error strategy accepts AVG because it matches the IDENTIFIER token (for a column name). The parser leaves the () out of the SELECT clause and attempts to parse them as part of the rule for the FROM clause, which obviously fails... and thus, again, we cannot extract the table name.

So, an extra adjustment to the error strategy helped complete the SELECT clause: when recovering from an error in a SELECT column, consume tokens until we find a FROM or a ,:

  protected getErrorRecoverySet(recognizer: Parser): IntervalSet {
    const defaultRecoverySet = super.getErrorRecoverySet(recognizer);

    if (recognizer.ruleContext.ruleIndex === SoqlParser.RULE_soqlField) {
      const soqlFieldFollowSet = new IntervalSet();
      soqlFieldFollowSet.add(SoqlLexer.COMMA);
      soqlFieldFollowSet.add(SoqlLexer.FROM);

      const intersection = defaultRecoverySet.and(soqlFieldFollowSet);
      if (intersection.size > 0) return intersection;
    }

    return defaultRecoverySet;
  }

The current full implementation is here.

I still don't fully understand ANTLR's error handling, and would like to spend more time on it. I feel there are a few other special cases which could be fixed with extra tweaks to this custom error-recovery strategy.

Unit tests

Even if you are not a devotee of the Test-Driven-Development church, it's very useful to write automated test first when implementing code-completion. The first thing I did for this challenge was to code some helper functions to write expressive unit tests quickly, making it easy to verify different scenarios and catching regressions as I added more functionality.

The simplest unit tests of the core completion logic look like the following:

validateCompletions('SELECT id |', keywordItems('FROM'));
validateCompletions('SELECT COUNT() |', keywordItems('FROM'));
validateCompletions('SELECT COUNT(), |', []);
validateCompletions('SELECT id,| FROM Foo', fieldsFor('Foo'));
// ...

They exercise the parsing and completion candidates of the completion logic, without any dependency on VS Code APIs.

Most unit test are in soql-language-server/src/completion.test.ts.

How to add code-completion to VS Code?

In general, you add new functionality to VS Code by writing "Extensions" (AKA "plug-ins").

Writing Extensions is relatively easy. Kudos to the technical writers on the VS Code team: the documentation is really good. It's clear, well-organized and covers a lot of ground. There are also plenty of examples and a lot of open-source extensions to steal learn from.

There's not much value I can add here. I'll just point to the main sources I've used:

To LSP or not?

One important decision to make when adding capabilities for a new language to VS Code is whether to use the LSP (Language Server Protocol) or not. LSP is a nice concept, but you are not forced to you use it: VS Code also provides its own API for implementing language features directly.

In short, these are the main advantages of using LSP:

  • You may implement your LSP server on any programming language
  • Your LSP-server implementation could potentially be re-used from other editors/tools

and disadvantages:

  • More moving parts
  • Less performant

For our particular case (a SQL-like language, for querying databases), using LSP adds some challenges because we have to fetch remote metadata for completion (like table and column names, more on this below). And since we are already using TypeScript (like VS Code), we don't need the independence on implementation language that LSP gives you.

That said, we still went the LSP route because we really want to provide an LSP server for the Salesforce community. We hope it'll be leveraged from other text editors and tools.

How to fetch completion data from the users's environment

When the user triggers code-completion in VS Code, the basic sequence looks like this:

General Case

Since we are talking about a database query language here, the data to use for code-completion is not locally accessible: it must be pulled from the remote database's metadata.

The handling of the connection to the users' Salesforce environment and DB is not trivial, and it's already solved as part of the core Salesforce VS Code Extensions. It includes handling of credentials, data caching and switching environments. The new SOQL language Extension can leverage all that, but it is not trivial to do so from the server-side of the LSP implementation... it is the LSP client that runs inside the Extension.

An easy approach could be for the LSP server to ask the LSP client for data when it needs it, but this would involve a lot of back-and-forth chatter and serialization of data:

Server asks Client for data

So, the initial approach I came up with is to split responsibilities:

  • SOQL LSP server: Language Parsing with the intelligence to identify what kind of data items to propose for completions at the current caret position. If the completion items require data from the DB, then return a special completion item indicating the LSP client to replace it with items from real DB metadata. Example: __SOBJECTS, to list of all SObjects (aka Tables), or: __SOBJECT_FIELDS:Account to list of all fields for "Account".
  • SOQL LSP client: Receive from the server the "special" completion items and replace them with real ones, fetching the data from the connection shared with the other Salesforce VS Code extensions.

This idea sounded crazy at first but worked fairly well so far:

Server asks Client for data

There's a drawback to this however: it imposes non-trivial responsibilities on the LSP client, making it harder to re-use from other editors. So, while this approach works fine and let me punt any connection-handling logic, we might want to change it in the future to handle the connections on the LSP server-side.

NOTE: The first time I read about LSP I thought (naively) that any editor or IDE with LSP support could simply connect to any LSP language server. This is not the case: most non-trivial LSP servers require each editor to also have a language-specific (even if small) LSP client. The main responsibility of each LSP client is to launch the server, and then pass on the connection information to the editor/IDE.

VS Code Integration tests

I also wrote integration test for the VS Code extension. These ones trigger the actual code-completion command, and they exercise the complete end-to-end integration for the VS Code editor UI, the initialization of the LSP client and its interaction with the LSP server.

The test cases look similar to the unit tests above:

 testCompletion('SELECT id FROM |', [
    { label: 'Account', kind: CompletionItemKind.Class },
    { label: 'User', kind: CompletionItemKind.Class }
  ]);

The key insight here is that you can programmatically trigger code-completion at any cursor position, and get the results back for validation, like so:

  const actualCompletionItems = ((await vscode.commands.executeCommand(
    'vscode.executeCompletionItemProvider',
    fileUri,
    position
  )) as vscode.CompletionList).items;

Pro tip: Configure the test runner to disable other VS Code extensions while running your integration tests: otherwise, they might provide other code-completion candidates intermixed with yours. The trick is to pass option --disable-extensions as described in testing extensions.

The implementation of the integration tests are at salesforcedx-vscode-soql/test/vscode-integration/client/completion.test.ts

Conclusion

I hope this write-up helps somebody in a similar situation. When I started I wasn't sure how to leverage the ANTLR grammar for code-completion (or if it was even possible). It turned out to be not only feasible, but practical with the help of antlr4-c3.

Please reach out if you have any suggestions or better ideas! I'm no expert on the subject.

Here's a quick demo (notice my favorite feature near the end, completing with ★ items on GROUP BY 😉):

There are a few other interesting problems I run into, like handling nested SELECT statements, and the embedding of SOQL LSP inside of Salesforce's Apex language LSP.... but enough for this post.

Thanks to my teammates from Salesforce who gave me the opportunity to take on this challenge. Specially to jgellin-sf, who was more closely involved in most decisions and who I kept bugging to QA my changes before merging.

Thanks!

Other References