Developing the code insights backend
MVP state of the backend
The current code insights backend is an MVP version contributed by @slimsag - it:
- Supports search-based insights today, but was designed to support any type of insight in the future.
- Is backed by a TImescaleDB instance.
- Follows Sourcegraph architecture pretty well, is as clean as I could make it, has pretty good test coverage, and is generally an OK starting point.
- Definitely has some work to be done / is very much an MVP.
- There is a project board with detailed info on what work remains in my eyes.
Ultimately, if you’re reading this you should view this code as a “take it or leave it” offering. My hope is you’ll benefit from it (even if just from a “what problems will I face?” explanation) and can use it as a starting point - but you should do what makes you most comfortable and successful. If throwing it all out helps, do that! Happy to help in any way I can, message me @slimsag on GitHub / @stephen on Slack.
Architecture
The following architecture diagram shows how the backend fits into the two Sourcegraph services “frontend” (the Sourcegraph monolithic service) and “repo-updater” (the Sourcegraph “background-worker” service), click to expand:
Deployment Status
Code Insights backend is currently disabled on sourcegraph.com
until solutions can be built to address the large indexed repo count.
Feature Flags
Code Insights is currently an experimental feature, and ships with an “escape hatch” feature flag that will completely disable the dependency on TimescaleDB (named codeinsights-db
). This feature flag is implemented as an environment variable that if set true DISABLE_CODE_INSIGHTS=true
will disable the dependency and will not start the Code Insights background workers or GraphQL resolvers. This variable must be set on both the worker
and frontend
services to remove the dependency. If the flag is not set on both services, the codeinsights-db
dependency will be required.
Implementation of this environment variable can be found in the frontend
and worker
services.
This flag should be used judiciously and should generally be considered a last resort for Sourcegraph installations that need to disable Code Insights or remove the database dependency.
With version 3.31 this flag has moved from the repo-updater
service to the worker
service.
Life of an insight
(1) User defines insight in settings
A user defines an insight in their user settings (today, only global user settings are supported #18397), e.g. at https://sourcegraph.com/site-admin/global-settings by adding:
"insights": [ { "title": "fmt usage", "description": "fmt.Errorf/fmt.Printf usage", "series": [ { "label": "fmt.Errorf", "search": "errorf", }, { "label": "printf", "search": "fmt.Printf", } ] } ],
This defines a single insight called fmt usage
, with two series of data (the number of search results the queries errorf
and fmt.Printf
return, respectively.)
Once defined in user settings, the insight will immediately show up for users at e.g. https://sourcegraph.com/insights - as long as they have the feature flag turned on in their user/org/global settings:
"experimentalFeatures": { "codeInsights": true },
(2) The insight enqueuer detects the new insight
The insight enqueuer (code) is a background goroutine running in the repo-updater
service of Sourcegraph (code), which runs all background goroutines for Sourcegraph - so long as DISABLE_CODE_INSIGHTS=true
is not set on the repo-updater container/process.
Every 12 hours on and after process startup (code) it does the following:
- Discovers insights defined in global/org/user settings (code) by enumerating all settings on the instance and looking for the
insights
key, compiling a list of them (today, just global settings #18397). - Determines which series are unique. For example, if Jane defines a search insight with
"search": "fmt.Printf"
and Bob does too, there is no reason for us to collect data on those separately since they represent the same exact series of data. Thus, we hash the insight definition (code) in order to deduplicate them and produce a series ID string that will uniquely identify that series of data. We also use this ID to identify the series of data in theseries_points
TimescaleDB database table later. - For every unique series, enqueues a job for the queryrunner worker to later run the search query and collect information on it (like the # of search results.) (code)
(3) The queryrunner worker gets work and runs the search query
The queryrunner (code) is a background goroutine running in the repo-updater
service of Sourcegraph (code), it is responsible for:
- Taking work that has been enqueued by the insight enqueuer (specifically, just search-based insights) and dequeueing it. (code)
- Handling each job (code) by running a search query using Sourcegraph’s internal/unauthenticated GraphQL API (code) (i.e. getting all results, even if the user doesn’t have access to some repos)
- Actually recording the number of results and other information we care about into the insights store (i.e. into the
series_points
TimescaleDB table) (code).
(4) The historical data enqueuer gets to work
If we record one data point every 12h above, it would take months or longer for users to get any value out of backend insights. This introduces the need for us to backfill data by running search queries that answer “how many results existed in the past?” so we can populate historical data.
Similar to the insight enqueuer, the historical insight enqueuer is a background goroutine (code) which locates and enqueues work to populate historical data points.
It is a moderately complex algorithm, to get an understanding for how it operates see these two explanations:
Naively implemented, the historical backfiller would take a long time on any reasonably sized Sourcegraph installation. As an optimization, the backfiller will only query for data frames that have recorded changes in each repository. This is accomplished by looking at an index of commits and determining if that frame is eligible for removal. code
There is a rate limit associated with analyzing historical data frames. This limit can be configured using the site setting
insights.historical.worker.rateLimit
. As a rule of thumb, this limit should be set as high as possible without performance
impact to gitserver
. A likely safe starting point on most Sourcegraph installations is insights.historical.worker.rateLimit=20
.
(5) Query-time and rendering!
The webapp frontend invokes a GraphQL API which is served by the Sourcegraph frontend
monolith backend service in order to query information about backend insights. (cpde)
- A GraphQL series resolver returns all of the distinct data series in a single insight (UI panel) (code)
- A GraphQL resolver ultimately provides data points for a single series of data (code)
- The series points resolver merely queries the insights store for the data points it needs, and the store itself merely runs SQL queries against the TimescaleDB database to get the datapoints (code)
Note: There are other better developer docs which explain the general reasoning for why we have a “store” abstraction. Insights usage of it is pretty minimal, we mostly follow it to separate SQL operations from GraphQL resolver code and to remain consistent with the rest of Sourcegraph’s architecture.
Once the web client gets data points back, it renders them! Contact @felixfbecker for details on where/how that happens.
These queries can be executed concurrently by using the site setting insights.query.worker.concurrency
and providing
the desired concurrency factor. With insights.query.worker.concurrency=1
queries will be executed in serial.
There is a rate limit associated with the query worker. This limit is shared across all concurrent handlers and can be configured
using the site setting insights.query.worker.rateLimit
. This value to set will depend on the size and scale of the Sourcegraph
installations Searcher
service.
Debugging
This being a pretty complex and slow-moving system, debugging can be tricky. This is definitely one area we need to improve especially from a user experience point of view (#18964) and general customer debugging point of view (#18399).
In this section, I’ll cover useful tips I have for debugging the system when developing it or otherwise using it.
Accessing the TimescaleDB instance
Dev and docker compose deployments
docker exec -it codeinsights-db psql -U postgres
Kubernetes deployments
kubectl exec -it deployment/codeinsights-db -- psql -U postgres
- If trying to access Sourcegraph.com’s DB:
kubectl -n prod exec -it deployment/codeinsights-db -- psql -U postgres
- If trying to access k8s.sgdev.org’s DB:
kubectl -n dogfood-k8s exec -it deployment/codeinsights-db -- psql -U postgres
Finding logs
Since insights runs inside of the frontend
and repo-updater
containers/processes, it can be difficult to locate the relevant logs. Best way to do it is to grep for insights
.
The frontend
will contain logs about e.g. the GraphQL resolvers and TimescaleDB migrations being ran, while repo-updater
will have the vast majority of logs coming from the insights background workers.
Docker compose deployments
docker logs sourcegraph-frontend-0 | grep insights
and
docker logs repo-updater | grep insights
Inspecting the Timescale database
Read the initial schema migration which contains all of the tables we create in TimescaleDB and describes them in detail. This will explain the general layout of the database schema, etc.
The most important table in TimescaleDB is series_points
, that’s where the actual data is stored. It’s a hypertable.
Querying data
SELECT * FROM series_points ORDER BY time DESC LIMIT 100;
Query data, filtering by repo and returning metadata
SELECT * FROM series_points JOIN metadata ON metadata.id = metadata_id WHERE repo_name_id IN ( SELECT id FROM repo_names WHERE name ~ '.*-renamed' ) ORDER BY time DESC LIMIT 100;
(note: we don’t actually use metadata currently, so it’s always empty.)
{"hello": "world"}
Query data, filter by metadata containing SELECT * FROM series_points JOIN metadata ON metadata.id = metadata_id WHERE metadata @> '{"hello": "world"}' ORDER BY time DESC LIMIT 100;
(note: we don’t actually use metadata currently, so it’s always empty.)
Query data, filter by metadata containing Go languages
SELECT * FROM series_points JOIN metadata ON metadata.id = metadata_id WHERE metadata @> '{"languages": ["Go"]}' ORDER BY time DESC LIMIT 100;
(note: we don’t actually use metadata currently, so it’s always empty. The above gives you some ideas for how we intended to use it.)
See https://www.postgresql.org/docs/9.6/functions-json.html for more metadata jsonb
operator possibilities. Only ?
, ?&
, ?|
, and @>
operators are indexed (gin index)
Query data the way we do for the frontend, but for every series in the last 6mo
SELECT sub.series_id, sub.interval_time, SUM(value) AS value, NULL AS metadata FROM (WITH target_times AS (SELECT * FROM GENERATE_SERIES(CURRENT_TIMESTAMP::DATE - INTERVAL '6 months', CURRENT_TIMESTAMP::DATE, '2 weeks') AS interval_time) SELECT sub.series_id, sub.repo_id, sub.value, interval_time FROM (SELECT DISTINCT repo_id, series_id FROM series_points) AS r CROSS JOIN target_times tt JOIN LATERAL ( SELECT sp.* FROM series_points AS sp WHERE sp.repo_id = r.repo_id AND sp.time <= tt.interval_time AND sp.series_id = r.series_id ORDER BY time DESC LIMIT 1 ) sub ON sub.repo_id = r.repo_id AND r.series_id = sub.series_id ORDER BY interval_time, repo_id) AS sub GROUP BY sub.series_id, sub.interval_time ORDER BY interval_time DESC
Inserting data
Upserting repository names
The repo_names
table contains a mapping of repository names to small numeric identifiers. You can upsert one into the database using e.g.:
WITH e AS( INSERT INTO repo_names(name) VALUES ('github.com/gorilla/mux-original') ON CONFLICT DO NOTHING RETURNING id ) SELECT * FROM e UNION SELECT id FROM repo_names WHERE name='github.com/gorilla/mux-original';
Upserting event metadata
Similar to repo_names
, there is a separate metadata
table which stores unique metadata jsonb payloads and maps them to small numeric identifiers. You can upsert metadata using e.g.:
WITH e AS( INSERT INTO metadata(metadata) VALUES ('{"hello": "world", "languages": ["Go", "Python", "Java"]}') ON CONFLICT DO NOTHING RETURNING id ) SELECT * FROM e UNION SELECT id FROM metadata WHERE metadata='{"hello": "world", "languages": ["Go", "Python", "Java"]}';
Inserting a data point
You can insert a data point using e.g.:
INSERT INTO series_points( series_id, time, value, metadata_id, repo_id, repo_name_id, original_repo_name_id ) VALUES( "my unique test series ID", now(), 0.5, (SELECT id FROM metadata WHERE metadata = '{"hello": "world", "languages": ["Go", "Python", "Java"]}'), 2, (SELECT id FROM repo_names WHERE name = 'github.com/gorilla/mux-renamed'), (SELECT id FROM repo_names WHERE name = 'github.com/gorilla/mux-original') );
You can omit all of the *repo*
fields (nullable) if you want to store a data point describing a global (associated with no repository) series of data.
Inserting fake generated data points
TimescaleDB has a generate_series
function you can use like this to insert one data point every 15 days for the last year:
INSERT INTO series_points( series_id, time, value, metadata_id, repo_id, repo_name_id, original_repo_name_id) SELECT time, "my unique test series ID", random()*80 - 40, (SELECT id FROM metadata WHERE metadata = '{"hello": "world", "languages": ["Go", "Python", "Java"]}'), 2, (SELECT id FROM repo_names WHERE name = 'github.com/gorilla/mux-renamed'), (SELECT id FROM repo_names WHERE name = 'github.com/gorilla/mux-original') FROM generate_series(TIMESTAMP '2020-01-01 00:00:00', TIMESTAMP '2021-01-01 00:00:00', INTERVAL '15 day') AS time;
Creating DB migrations
Since TimescaleDB is just Postgres (with an extension), we use the same SQL migration framework we use for our other Postgres databases. migrations/codeinsights
in the root of this repository contains the migrations for the Code Insights Timescale database, they are executed when the frontend starts up (as is the same with e.g. codeintel DB migrations.)
Currently, the migration process blocks frontend
and repo-updater
startup - which is one issue we will need to solve.