Speed Through Greed - Optimizing Roslyn Reference Lookup

Tags: roslyn, Symbol, FindReferences, ArchiMetrics

In the ArchiMetrics project I wanted to include metrics based on coupling (type coupling as well as afferent and efferent coupling). In order to calculate afferent coupling, you have to know where things are used. Roslyn comes with a FindReferencesAsync helper method. The problem with the helper method is that it is slow. Not slow to the extent that a single lookup from a VS extension would kill your system, but slow if you are doing thousands of lookups.

I tried to have a look at the code for the helper method, and the first couple of times I got lost. It's beautiful code, but it's an orgy of partial classes, overloaded methods and confusing (for me at least) instantiations. Once you spend some time with the code, you find out that it tries to optimize it's performance by filtering the number of projects to search by looking at the dependency graph. It then goes through the project documents and compares symbols against the one you are searching for and builds up a result set using concurrent dictionaries and immutable collections.

This is probably the optimal way if your scenario is the occasional lookup in a changing code base, i.e. if you are writing code in Visual Studio and you need to find references based on your latest changes. My use case is different though, I have a code that does not change during the analysis, so catering for changes is not relevant. This means that I can change the lookup sequences to a greedy single pass lookup. So instead of doing a comparison against a particular symbol when searching through the source files, I can harvest all the symbols and the location they are used. I can aggregate these results and group them by symbol, which gives a nice symbol to usage lookup table.

In practice the symbol resolver goes through every node and attempts to resolve a symbol for the node. If this is successful, then the location is recorded as well for future reference. The benefit of this approach is that source files can be processed in a single pass, which greatly reduces the I/O overhead, but also that symbols do not have to be compared until a later stage.

public static IEnumerable<IGrouping<ISymbol, Location>> Resolve(this Compilation compilation, SyntaxNode root)
{
	var model = compilation.GetSemanticModel(root.SyntaxTree);

	var fields = root.DescendantNodes()
		.Select(
			x =>
			{
				var symbol = model.GetSymbolInfo(x);
				return new { symbol = symbol.Symbol, node = x };
			})
		.Where(x => x.symbol != null)
		.Select(x => new { type = x.symbol, location = x.node.GetLocation() })
		.GroupBy(x => x.type, x => x.location)
		.ToArray();

	return fields;
}

On a solution wide basis the resolved locations can be added to a central repository for that solution which aggregates all the locations into a dictionary with the symbol as the key. This way all reference locations can easily (and almost immediately) be looked up from the symbol.

private async Task Scan(Solution solution)
{
	var roots = await GetDocData(solution).ConfigureAwait(false);

	var groups = from root in roots
				 let compilation = root.Compilation
				 from syntaxNode in root.DocRoots
				 from @group in compilation.Resolve(syntaxNode)
				 where @group.Any()
				 select @group;

	foreach (var @group in groups)
	{
		_resolvedReferences.AddOrUpdate(@group.Key, @group.ToArray(), (s, r) => r.Concat(@group).ToArray());
	}
}

Here you can see the code for the ReferenceResolver.

As mentioned above, this approach works if the code base does not change between lookups. With the Roslyn helper method, reference lookups is slow to the point of being useless for anything but the smallest solutions if you are doing a general resolution of symbol references. With this approach reference lookups for very large projects can be done in a matter of handful of minutes - approximately 1 minute per 100.000 lines of code. A further optimization could be to only look up symbols which can be referenced, which means filtering out nodes which are not referenceable before doing the symbol lookup.

Latest Tweets