r/RedditEng

*Written by Chris Schenk (*u/The-Real-Zucchini*)*

TL;DR

The GraphQL team at Reddit is expanding our observability to include build-time static code analysis to detect availability and latency regressions during the dev cycle before code goes out to production. This post covers how we detect changes to backend services calls in GraphQL changes that may fundamentally shift internal traffic to services within Reddit - before those changes go live in production. We wrote two versions, with the second version running in ~5 minutes in CI, a 36x speed increase compared to the first version.

Background

Reddit uses many standard tools to serve all the best Star Wars memes and arguments of when to celebrate May the Fourth (it’s May 25th, if you’re wondering). We’ve standardized on some critical technologies:

  • Go, as the successor to our Python services
  • gRPC, as the successor to Thrift
  • Prometheus and Grafana to monitor our services in real-time
  • OpenTelemetry tracing across our services to debug slow requests

GraphQL as our client API for the rest of Reddit to build features

GraphQL Primer

We’ve written recently about GraphQL in posts The Five Unsolved Problems in GraphQL and Protecting your GraphQL. This is a continuation of our ongoing efforts to make running our GraphQL service less of a burden on our team and create more stability and availability.

GraphQL promises an “implement once, use anywhere you need” system. GraphQL is a type system with named types which can either be scalars (Int, String, etc) or fields on object types. Anyone can extend the schema of your data model of types such as Post or Comment in the case of Reddit. Once a new field is added to any of these types and the backend is implemented to fetch and return that data, you can re-use it any time you like.

The Setup

In reality, not all data is created equal, and neither are the backend systems that hydrate it. As described in our Protecting your GraphQL post, the data backing your home feed, post, and comments have different criticality than loading your settings. To frame our problems, we have the following:

  1. Our Go-based GraphQL service has millions of lines of code
    1. With hundreds of contributors to the codebase.
  2. Our schema defines thousands of types
    1. With many 3.5x times more fields for those types.
    2. Supports very diverse feature sets that require different handling, scaling, and criticality classification.
    3. This number grows daily.
  3. Our service talks to hundreds of backend services to fetch and mutate data for all of our users.

The Problems

GraphQL Makes it Easy To Break Scaling

GraphQL allows you to add any data to any query at any time. Take this example of an outage that has occurred periodically:

  1. An engineer adds a field to a high requests-per-second (RPS) operation.
  2. A low-scale backend service sees a significant increase in RPS.
  3. The service runs out of memory, fails to scale, and causes the operation to return an error.

Our team is able to identify and implement workarounds, yet we remain vulnerable to these kinds of changes.

https://preview.redd.it/33nkwunf86zg1.jpg?width=586&format=pjpg&auto=webp&s=811edeab32d1da0c17a75f4acc126db2b1b59527

Running Multi-Region Requires More Care

Reddit is a global service, and in order to serve our users with a good experience, we also deploy GraphQL to multiple regions. But not every backend service is deployed to every region. While we can fall back to dependencies across regions, this can add latency.

In a multi-region world, we must take great care to not cause users to have a worse experience, so we must know in advance if modified GraphQL dependencies might cause problems before we deploy.

GraphQL Makes Debugging More Difficult

As GraphQL has grown into a monolithic API for our clients, the team receives many questions, including asks to perform root-cause analysis. While these problems are aren’t incident-worthy, it creates a non-trivial burden on the team:

  • I added field `Post.foo` to my query, and it took down production. Why?
  • Why does query `GetFoo` have 300ms increased latency since date X?
  • If I remove field X from my query, will it reduce latency?
  • Is the field `MyType.foo` used in any operation?
  • What backend services are called for field `MyType.foo`?
  • Why did RPS to upstream Q increase 30x?

Many of these questions arise because we already have good runtime observability and can see when we have latency or availability regressions after we deploy code. The follow-up challenges lie in:

  1. Manually reading through our extensive codebases for the cause.
  2. Analyzing pull request (PR) diffs to identify why a change was made.
  3. Cross-referencing with changes and deployments from other repositories and teams if it wasn’t our code.

The above is common for any central platform team, but our time is limited and we need more self-service solutions for our contributors.

Our Motivating Questions

For some of our incidents, we began to wonder how we can improve our observability prior to some of these changes going out. We began asking the following questions after documenting a lot of the above issues:

  • Can we detect runtime regressions at build time or development time?
  • What problems specifically can we detect at build time?

Limits of Current Observability

To motivate the decisions for our dev-time solution, we will go into the limitations of our current observability. We can always add more telemetry, but solving for the above use-cases is not straightforward.

High Cardinality Breaks Prometheus

https://preview.redd.it/h57v53kg86zg1.jpg?width=549&format=pjpg&auto=webp&s=7fe356ee1730c02d725dfa8391e2c0385cc35802

Today our telemetry includes two key metrics as shown in the diagram below:

  1. Unique GraphQL operation identifiers requested by clients.
  2. RPS and latency for each backend service endpoint call made.

One might think that we could simply put these values as labels in a single metric, and in fact we already tried! Each unique label value has a multiplicative factor on the total number of values saved in Prometheus. When we attempted to track both of these on a single metric, we went beyond the capability of our Prometheus instances due to:

  1. Tens of thousands of operation identifiers live at any given time.
  2. Hundreds of backend service endpoints.
  3. Unique Kubernetes identifiers for each running pod.

These constraints are well-known and normal in any production metrics system, and we have to work within them.

Additional Abstractions Missing in the Metrics

Additionally, we have layers of abstraction between these two metrics that we still would not know even if we had one metric. Specifically we need to know the links between our GraphQL operations (teal) and fields (purple) to resolver functions (blue) and backend service calls (green+orange).

AI tools do help with this problem and we’ve started to use them for this purpose, and yet they can still run out of context and miss details with the size of our codebase in particular.

https://preview.redd.it/eteu6efh86zg1.png?width=1967&format=png&auto=webp&s=b95dd392aabdf389c447bc2e367dff935f4779e8

What We Get if We Have The Data

Let’s revisit the example mentioned above of an engineer adding a GraphQL field to an existing, high-RPS, critical operation. Let’s call this operation `Colors`. During development time, we would be able to do the following with this data:

  1. Engineer modifies the `Colors` operation by adding field `transparency`.
  2. Our CI systems detect critical GraphQL operation `Colors` has a field added.
  3. Look up resolver function for the `transparency` field.
  4. Find endpoint `GetTransparency` on `rgb` service is called to resolve the field.
  5. Check if `rgb` service is fully deployed to all regions.
  6. See that the service is not yet deployed everywhere.
  7. Warn engineer and others that they will cause a latency regression if the change is made live.

This is only one example, and there are a lot of areas of prevention to explore once we are able to fill these data gaps.

A Static Code Analysis Prototype Is Born

As part of a Snoosweek (Reddit's semiannual hackathon) in the Fall of 2025, I explored static analysis to fill in the missing data. Could we effectively find endpoint calls within the resolver functions in order to tie fields to dependencies when GraphQL resolves in an operation?

Our first prototype showed us that it’s possible to do, so we made time on our roadmap to create what we call Dependency Mapper, specifically for our Go subgraph.

Goals

We wrote down the following goals for our design:

  1. Summarize code changes involving endpoint calls.
    1. Link resolver functions to endpoint calls.
    2. Link GraphQL fields to resolver functions.
  2. Detect negative impacts to production availability, performance and efficiency.
  3. Do it at development time.

Building this nuanced analysis as requests flow from frontend to backend is what we mean by adding more “end-to-end” (e2e) observability.

Go Dependency Mapper V1 with gopls

The gopls tool implements the language server protocol (LSP). This is what your IDE uses to show you type information, allow you to click-through function calls, or find implementations of an interface in Go.

Since gopls already has logic to traverse codebases, we opted to use it in our first implementation of mapping our endpoint service dependencies for our GraphQL field resolvers. While gopls was a good first choice to prove the concept, this approach had multiple problems and missed endpoint calls.

Go Interfaces are a Hard Stop

This is speaking specifically to interfaces in the Go programming language, not GraphQL. If the mapper ran into a Go interface type that had more than one struct implementation, the program would not be able to traverse into the implementation without knowing which one to use. Normally we humans select which implementation to visit in our IDE, but a robot can’t do that without context.

This causes the mapper to miss entire sections of the code base, as interfaces are used by contributor teams for their own internal patterns, particularly for complex entity types like feeds and posts.

gopls is Memory Inefficient

gopls is fine for IDE use, as the number of requests issued during code authoring is quite low. But at a higher scale, mapping the entirety of the GraphQL resolver codebase, it uses up all available memory it can. This caused pods to be OOMKilled in Kubernetes as gopls would use more memory than available to the pod. Analyzing our codebase took ~3 hours, and we’d have to periodically stop and restart gopls to prevent this from happening.

Generally, gopls is also not meant to receive tens of thousands of requests per second; this isn’t the right use-case for it. Even running a single unit test would take minutes to complete, so we could not effectively iterate to improve the algorithm, and required us to do a second implementation.

https://preview.redd.it/fqzmzw8l86zg1.jpg?width=1056&format=pjpg&auto=webp&s=7465c4cff0d31305e413fa69c280ae297a460b91

Version 1 Is Still A Win

Even with these problems, our first run across the entire codebase proved to be illuminating, seeing how many endpoint calls each resolver function was making even when the dataset was incomplete. We also knew that in order to rely on these GraphQL field-to-endpoint mappings to make effective decisions to protect our infrastructure, we had to make it as complete as possible.

Go Dependency Mapper V2 using Go AST Traversal

In order to replace gopls, we had to reimplement what it is doing internally, and it is parsing and traversing the codebase by analyzing the Go Abstract Syntax Tree (AST) for our code.

Walking the AST is a Recursive Problem

An AST node represents a logical piece of syntax in the language. These nodes can reference each other in a recursive manner. Some examples from Go include:

  • `ast.IfStmt` - represents an `if` statement and all of its constituent parts.
  • `ast.CallExpr` - represents a function call expression and its arguments.
  • `ast.Ident` - a single-name identifier, such as a variable name “foo” used in `foo := 12`.
  • `ast.SelectorExpr` - a selector from a variable or package, such as accessing a struct member like `myStruct.SomeVal = …` or `mypackage.OtherFunc(...)`.

Once we have the code parsed into the abstract syntax tree, we can then walk each of the nodes and inspect information about them in order to detect the endpoint calls we’re interested in. We end up having switch statements that make for a regular recursive algorithm. You can read up on all the statement and expression types defined for the Go language AST at the Go language specification:

// walkStmt walks a single statement
func (dm *DepMapper) walkStmt(stmt ast.Stmt, pkg *packages.Package, ctx *WalkContext) {
pos := pkg.Fset.Position(stmt.Pos())
dm.logTrace("walking %T statement at %s %v\n", stmt, pos, stmt)
switch s := stmt.(type) {
case *ast.AssignStmt:
dm.walkAssignStmt(s, pkg, ctx)
case *ast.ReturnStmt:
dm.walkReturnStmt(s, pkg, ctx)
...
}
}

// walkExpr walks an expression
func (dm *DepMapper) walkExpr(expr ast.Expr, pkg *packages.Package, ctx *WalkContext) ExprType {
pos := pkg.Fset.Position(expr.Pos())
dm.logTrace("walking %T expression at %s %v\n", expr, pos, expr)
switch e := expr.(type) {
case *ast.CallExpr:
return dm.walkCallExpr(e, pkg, ctx)
case *ast.FuncLit:
return dm.walkFuncLit(e, pkg, ctx)
...
}
}

https://preview.redd.it/i4t8ag7m86zg1.jpg?width=500&format=pjpg&auto=webp&s=0af55854cafeaed0546afaab47eb63642224a611

Parsing the Code

We use a combination of the following libraries to replace gopls:

The packages.Package package (yep, that’s right) gives you options when loading and parsing the packages in `packages.Load`. By far the most important thing is to load all your packages together, as this library will reach out to nearby packages and parse them anyway. In our codebase, parsing our codebase with `packages.Load` takes ~2 ½ minutes, and we are expecting a significantly faster runtime compared to V1 and gopls when traversing the AST directly.

patterns := []string{
"github.com/reddit/graphql/packageone...",
"github.com/reddit/graphql/packagetwo...",
}
cfg := &packages.Config{
Mode: packages.NeedName |
packages.NeedFiles |
packages.NeedSyntax |
packages.NeedImports |
packages.NeedTypes |
packages.NeedTypesInfo,
Dir: "path/to/go/mod",
}
pkgs, err := packages.Load(cfg, ...patterns)

Fast Iterate-And-Test Loop

The slow speed of V1 makes it impossible to find and fix bugs, as traversing a single resolver function with gopls can take 3 minutes. Here in V2, `packages.Load` also caches its parsed files, so subsequent loads of code across executions take less than 10 seconds, which is impressive for a codebase of our size. This enabled us to write unit tests for various edge cases in the algorithm with a much faster time to completeness of our dataset.

https://preview.redd.it/qh4rcx3n86zg1.jpg?width=865&format=pjpg&auto=webp&s=b7c84986f7845023a05a748bbaed48ad59e6bbba

The Go Interface Traversal Problem

The most important problem to solve replacing gopls is traversing through Go interface types encountered in the codebase. Even though `packages.Load` above gives us type information, it doesn’t give us runtime type information. Let’s illustrate with an example.

In this code, we have two service clients that both have the `GetPost` endpoint, `ProfileHydrator` and `SubredditHydrator`:

package services

import postpb "reddit.com/subreddit/api"
import profilepb "reddit.com/profile/api"

type SubredditHydrator struct {
// Our gRPC client for posts
postClient postpb.SubredditClient
}

func (s *SubredditHydrator) GetPost(id string) (Post, error) {
return s.postClient.GetPost(id)
}

type ProfileHydrator struct {
// our Thrift client for profiles
profileClient profilepb.ProfileClient
}

func (p *ProfileHydrator) GetPost(id string) (Post, error) {
return s.profileClient.GetPost(id)
}

We have a Clients struct that is initialized with each service client at startup:

package clients

type Clients struct {
Subreddit services.SubredditHydrator
Profile   services.ProfileHydrator
}

func New(cfg Config) *Clients {
c := &Clients{}
c.Subreddit = services.NewNewPostHydrator(cfg.Subreddit)
c.Profile = services.NewProfileHydrator(cfg.Profile)
return c
}

Now we have a helper function that loads a post for anything that has `GetPost`, specifically accepting the `PostHydrator` interface type as a parameter:

// PostHydrator allows for loading anything that looks like a post
type PostHydrator interface {
GetPost(id string) (*model.Post, error)
}

// DoPostHydration takes any type of post hydrator
func DoPostHydration(id string, hydrator PostHydrator) (*model.Post, error) {
return hydrator.GetPost(id)
}

In our GraphQL Go service, we use gqlgen as our execution engine. Our field resolver functions all have a receiver struct that is auto-generated, such as `queryResolver` or `mutationResolver`. These receivers have access to the Clients struct initialized with the service so the resolvers can make service calls to hydrate data:

package resolver

import "reddit.com/graphql/clients"
import "reddit.com/graphql/services"

type Resolver struct {
Clients                *clients.Clients
...
}

type queryResolver struct{ *Resolver }

// SubredditPost is the resolver for the subredditPost field.
func (r *queryResolver) SubredditPost(
ctx context.Context,
postID model.ID,
) (*model.Post, error) {
// Load post using the helper function
result := services.DoPostHydration(postID, r.Clients.SubredditHydrator)
return result
}

Notice the above where we pass in `r.ClientsSubredditHydrator`. When we traverse into the `services.DoPostHydration` call without any additional context, we don’t know what concrete type was sent into the function by looking only at the type signature of the `DoPostHydration` function. This is the same limitation as gopls. While gopls can find all implementations of an interface, it leaves the choice of which one to follow to the user. Since this is a program, we won’t have a human available to make that choice.

We conclude three things about solving this problem:

  1. As an invariant, the runtime execution context has all the concrete implementations available,
  2. We need a way to find concrete implementations during static analysis, and
  3. We must track variables and their types as they’re used throughout the code in order to collect the implementations we need for continued traversal in the code.

Tracking variables and types as a pattern enables us to do the following:

  1. Detect and track the concrete types for our services returned in our `clients.New` startup function.
  2. Bind those concrete types to the GraphQL resolver function variables `r.Clients` above.
  3. Pass variables as parameters to function calls throughout the call tree.

With this logic implemented, we can traverse our codebase and detect each endpoint call and their locations, linked with our GraphQL fields.

Detecting Thrift and gRPC Calls

A simple way to have implemented this detection would be to hard-code the Client APIs in a list and look for those in the resolver functions. Since the systems architecture at Reddit continues to evolve and we add new services, we needed a generalized approach to handle any new clients that are added to our codebase.

With the recursive nature of the algorithm, we are able to traverse into all code and make decisions based on where we are. Since Thrift and gRPC generate code, we can rely on patterns in the generated code to detect if a function call resides within that code. After analysis, we found the following statements in the generated code to use as our detection heuristic.

For Thrift:

var _ = thrift.ZERO

For gRPC:

const _ = grpc.SupportPackageIsVersion9

For HTTP, we pass in a specific package and struct name combination for the struct used for all HTTP service clients.

The Performance of V2

The full runtime of the V2 Dependency Mapper after the initial AST parsing depends on how many GraphQL resolver functions we have to traverse, including how complex the call trees are underneath them.

GraphQL Resolver Functions

There are a number of interesting implications when using gqlgen for GraphQL execution:

  • Each top-level GraphQL field automatically gets a new resolver function.
  • Additional per-field resolver functions can be added to gqlgen.conf.

For our many thousands of fields in our schema, only about 10% have resolver functions defined for them. This means each resolver function is responsible for resolving a lot of data that may be requested underneath it. We can correlate resolver complexity with how much time it takes to traverse the function call chain.

The Numbers

The amount of time to traverse our GraphQL resolvers is ~2 ½ minutes for a total of ~5 minutes of runtime combined with the `packages.Load` parsing of the code. This is a 36x speed increase in our runtime.

WIth a runtime at ~5 minutes, this is fast enough for us to move away from a Kubernetes cron job to standard validation during our build process in CI. Every time we push code to our mainline branch and release it to production, we are guaranteed to have a static analysis dataset of what’s latest in production just minutes after landing.

This is also single-threaded and we could get further runtime gains if needed by adding Goroutines to process a queue of resolver functions. 

The Result

We finally arrive at our destination: a mapping of GraphQL Operation’s fields (purple) to resolver functions (blue) to service endpoint calls (green+orange). We store the dataset identified by the Git commit SHA of what code was analyzed which enables us to link it to a release in production.

https://preview.redd.it/v0bkkrog96zg1.png?width=1961&format=png&auto=webp&s=567b0c3c37bdd14b37afe204836d271177160572

The example JSON below demonstrates the data:

  1. The associated GraphQL field in `graphQLField` object: `Query.postById`.
  2. Information about the resolver function, in this case `func (r *mutationResolver) GetPostById(...)`.
  3. Service endpoint calls in the `endpointCalls` object.

{
  "git": {
"sha": "d3f43d80f68583cfff85aca3869d011498134107",
  },
  "createdAt": "2026-04-08T21:22:42Z",
  "durationNanos": 194648368750,
  "service": {
"serviceName": "graphql",
"language": "go"
  },
  "targets": [
{
"serviceType": "graphql",
"durationNanos": 194648367792,
"graphQLData": {
"configFilePath": "gqlgen.yml",
"resolverFunctions": [
{
"package": "reddit.com/graphql/internal/resolvers",
"filename": "resolver.go",
"line": 283,
"column": 1,
"functionName": "GetPostById",
"functionReceiver": "mutationResolver",
"graphqlField": {
"parentType": "Query",
"fieldName": "postById",
"isDeprecated": false
},
"endpointCalls": {
"total": 1,
"countsByUpstream": {
"reddit.com/graphql/internal/backend.Clients.Post": {
"name": "reddit.com/graphql/internal/backend.Clients.Post",
"total": 1,
"countsByEndpoint": {
"GetPostsByIds": 1
}
}
},
"calls": [
{
"package": "reddit.com/graphql/internal/backend/posts",
"filename": "posts.go",
"line": 408,
"column": 42,
"clientID": "reddit.com/graphql/internal/backend.Clients.Post",
"clientLocation": [
"reddit.com/graphql/internal/backend.Clients",
"Post"
],
"endpointName": "GetPostsByIds",
"protocol": "grpc"
"callStack": [
...
]
}
]
},
"durationNanos": 424958
},
]
}
}
  ]
}

Limitations of Static Analysis

Static Analysis Tells You What Might Happen

It does not tell you what actually happens. This is an important distinction when making sense of the detected endpoint call counts. The static analysis essentially gives you the worst-case call counts if every single branch in every portion of traversed code was executed, including all conditional branches, which is never the case in reality.

For Loops are a Problem

Since we’re acting as an interpreter, “for” loops become a problem:

orderedIdx := make([]int, 0, limit)
for i := 0; i < limit; i++ {
orderedIdx = append(orderedIdx, i)
}

Since we are an interpreter, we don’t actually know the value of `limit` during our analysis, so we currently are unable to properly assign variables and process the block with the correct variables. We have cases in our codebase where function literals that contain backend service calls are added to a slice and then iterated and handed off to Go routines, and we have yet to come up with a solution for this.

Range statements are similar, but are more approachable.

ptrEvents := make([]*model.Event, 0)
for _, event := range events {
ptrEvents = append(ptrEvents, event)
}

Here `events` is a slice that we may have been able to internally track through built-in `append` calls. If so, we would be able to iterate the values we could interpret and run the block with the correct variable assignment. However, if the `events slice was assigned through indexing (e.g. `events[i] = myValue`) we would not have the data.

Ultimately, we may be able to solve this problem with detecting index references inside of loops and implement a back-tracking algorithm to iterate when we see a slice indexed by an integer. This is future work for us to explore as it would require a decent amount of work to implement roll-back functionality, especially if the slice reference happens further down the stack through another function call (which is possible).

How We’ll Use This Data

Reducing Data Over-Fetching at the Operation Level

We are already underway with client projects to reduce data over-fetching and make the app more efficient and performant. With this data set, we can now parse a full GraphQL operation and look up the field mappings while we’re traversing the operation and summarize all the possible work that an operation might perform during execution.

Our client teams have also generated data sets through static and runtime analysis of what data is fetched but not referenced within client code. The next step is to analyze the unused fields and group them by resolver function, so client teams can prioritize removing groups that result in entire backend endpoint calls being removed from the runtime execution, resulting in faster page loads for everyone.

Regional Service Readiness

As Reddit continues to expand its global infrastructure footprint, we want to know what GraphQL operations are fully-servable within a region. We aren’t yet able to roll out all our services at once when serving in a new region, so we want to use this data set alongside our Achilles SDK which we use to manage our Kubernetes workloads, to detect if an operation can be fully served out of a region or not. This way, we can perform intelligent routing to keep your Reddit experience quick and responsive, no matter where you’re coming from in the world.

Analysis for Backend Go Services

Since the Dependency Mapper fundamentally operates on analyzing a function and all of its dependencies, we can adapt it to also work on our backend services and continue to build out a static analysis graph across service calls at the company.

Detection of Database and Experiments Calls

The logic for detecting “edge” calls that exit the system could be easily extended beyond endpoints to support systems such as:

  • Redis, Memcache
  • Postgres, MySQL or No-SQL databases
  • Sqlc queries and extraction
  • Experiments systems calls
  • And more!

We can add these as a configuration parameter to enable/disable at analysis time. We can detect uses of any of the associated libraries and track those to be reported in the final data set as well.

Tracing Data Sources for Fields

Today, the Dependency Mapper tracks what backend calls are made during execution. The algorithm and data structures could be extended to tell you exactly where a piece of data comes from when it is returned in the GraphQL API, even if that data is derived from multiple sources. This is helpful as we continue to migrate data to dedicated services and need to know where data is used so we can update references in our code.

And Finally

We reached our goal to connect our two runtime datasets together with a static analysis dataset, and have a strong roadmap for adding more functionality for detecting more regressions before they go to production.

Special thanks go to our teammate Brendon Kofink for his V1 implementation of the Dependency Mapper.

We’re always looking to improve our infra here at Reddit, and this is an observability gap we are excited to fill. Let us know how you’re continuing to improve your observability, too.

https://i.redd.it/lpx4wzzp96zg1.gif

reddit.com
u/Okgaroo — 10 days ago
▲ 67 r/RedditEng+1 crossposts

Localization at Reddit: Developing for a Global Audience

Written by Cláudio Ribeiro u/EmeraldMacaw

TL;DR: Considering localization (L10n) at the inception of an online product isn't just a “nice to have,” it helps beyond translations by keeping the code cleaner, improving the UI's flexibility, and making sure the text content is top-notch.

Oftentimes, when a new online product is released, translation is treated like a future problem. It seems logical to say “I'll come back and fix it once we've scaled.” This happens often with software created by companies focused on a local market. But, including localization in the beginning is helpful beyond reaching more users: it makes the code more readable and guarantees text will display as intended everywhere.

Localizing after a product is out can be compared to making a fuel car electric, or trying to restyle a subreddit after millions of users have already gotten used to it. The effort required to retroactively localize is the most compelling reason to not leave it as an afterthought. Take Reddit, for example: our first attempt at localizing Old Reddit was crowdsourced and loosely supervised, which created an inconsistent experience and incomplete translations. Contributors also lacked the necessary context and visual aids to get the work done. In the end, few people used the localized versions and Reddit remained an English-first platform. (Though I must recognize the Pirate English version was pure gold.)

Once it became more noticeable that more and more people from different backgrounds and origins were browsing, contributing and creating, Reddit began working on a localized, globalized heart of the internet. Our first attempts were timid (volunteer translators commenting their suggestions in threads that contained the source strings), but we’ve matured our approach. We’ve implemented a translation management system (TMS) and are developing code in ways that keep localization in mind. Reddit now offers translations into 35 languages from 33 countries and supports 7 different alphabets that are used by millions of users.

Not surprisingly, we faced some setbacks before we got to where we are today: alphabets that wouldn't render, translations that weren't 100% adequate (as reviewers couldn't edit them), truncated text where the UI lacked room, untranslatable content (try translating the Tragedy of Darth Plagueis the Wise…), a mess with genders, plurals, and syntax, etc. These were difficult challenges to overcome, and we learned lessons along the way.

On that note, I’d like to share some of them with you. Below we’ll focus on some key aspects that illustrate the pros of pre-planning and how to get the house in order: accessibility, design, content review, time-to-market, data analysis, quality assurance, and code maintenance.

Accessibility

One of the first places where localization proves its worth is adding descriptions to images, buttons, and options (they even have their own writing style to be most useful to the end-user) to make a platform more accessible. Localizing the website is still relevant even when there are no ambitions to expand the brand abroad, as it's essentially “localizing” for users with impaired vision.

By making sure accessibility is implemented, a company can access a market within their own domain, and it becomes easier to localize into other languages. Accessibility is a way of “localizing” for users who need it; it includes different alphabets, such as  sign language and braille.

At Reddit, focusing on accessibility was a gamechanger. We improved our apps to include those with impaired vision, which allowed us to better serve our existing users–and to remain inclusive when we entered new markets with new languages.

Content descriptions can provide translations to screen readers, too

When it comes to a product's design, localization can also help in less obvious ways. English is a “short” language, meaning it doesn't take much space and can express a lot of information without a lot of characters. This makes it easy to fit into tiny spaces, but other languages can take up way more space (up to 40% more than English in some cases) and that can often break the UI for users of longer languages.

This is where pseudo-localization comes in: it can run in design tools (Figma, Sketch, Penpot) and in the code, artificially expanding each word in English to 20~40% its size in a random distribution, allowing for designers to account for the most expansive languages without compromising the original content. It's like using “banana for scale” for buttons. Using pseudo-localization to design products improves the overall experience by preemptively ensuring the UI is comfortable to use in any resolution and language.

Psuedo-localized text in English

Cross-Functional Collaboration

When localization is introduced into a product's development lifecycle, it could mean potentially dozens, if not hundreds of extra specialized eyes will carefully inspect each word being written by different teams, for different projects, at different times. In order to do their best work translating, they interpret the context and the intention, so they can preserve the tone in their language. Their reading is way more focused than the general audience's, especially on a  platform like Reddit, which has a truly unique personality and tone.
Translators are professional linguists who need to read, interpret, and fully understand each string we publish, be it in our product itself, in the Reddit Help Center, or even in marketing material. This greatly amplifies Reddit's capacity to fix typos, outdated content, inconsistent experiences, and so on, as translators need to pay more attention to the source and often find errors a casual reader might miss.

A set of corrections spotted by linguists in Reddit's Help Center

Linguist eyes bring an extra level of polish to what we write and make our content even more slick and “together,” which translates into trust with our users. A translated product name can sometimes even serve as inspiration for a company's naming conventions, and since each culture has its unique way to express itself, different perspectives can make what we say more universal and human, which is kind of our thing.

Linguists will navigate through the entire UI/UX to see the localized product in practice. This allows them to help engineering teams by finding issues that might have been overlooked, or that the regular user wouldn't bother to report. Any new feature release gets extra pairs of hands play testing the content. This, allied with the content review component, adds an extra layer of polish and  results in a better overall user experience. 

Sometimes an L10n bug will also help to improve existing English content

Localization Infrastructure

Caring for localization infrastructure helps in content homogeneity, but also makes us more nimble when it comes to market expansion. Even if expanding into new markets seems like a distant dream, getting the structure ready from the get-go gives a company much more speed when deciding to launch in a new region. 

Implementation of plurals in a string

Properly implementing plurals is important because many languages have more than “one, other” plural options.

It ensures that code is ready to connect to translation management tools when needed, and dramatically reduces the cost and time spent to get things in place for translators and engineers. The effort required to go back and finish an incomplete structure and remove redundant code when more challenging markets come aboard (e.g. multiple plurals, different characters, right-to-left orientation, etc.) will inevitably delay your go-to-market timeline in those markets. When Reddit  introduced Arabic, addressing these concerns was critical to how we shaped our approach and launch strategy.

Reddit in Modern Standard Arabic

By creating strings with localization in mind, the code also becomes cleaner and string drift is avoided (i.e. we don't have the same word being spelled in three different ways in three different files). Centralizing all the product's strings means normalizing the storage of site content, which is a core tenet of good database and software design. We decoupled the management of translations from logic and reduced complexity and overhead in our code.
Engineering with L10n in mind helps make the code cleaner, more readable, and robustly documented. It's easier to understand where any string gets inserted, makes changes simpler and safer (ensuring no hardcoded strings ever reach production), and paves the way for automated tests that can enforce  best practices.

After an online product is localized, bugs are squashed, and continuous rounds of testing are carried out to ensure nothing is broken and that translations follow the context, are easy to understand, and don't conflict with the UI. That’s a huge win for users and development teams.

Strings before descriptions have been added

Descriptions can also be helpful for engineers who might need to update a piece of code related to a specific string.

L10n should be woven into every aspect

Localization ties together linguistics with development, influences marketing strategy, provides data for a coordinated expansion, encourages best practices, and is intimately intertwined with product development, whether it has been activated or not. That is why it can't be seen as a “plug-in” you can add at a later stage, but as a foundational layer that must be taken into account at the ideation stage. It will most certainly save you a headache in the future.

I'm new to L10n. Where should I start?

Check out the Unicode CLDR Project and find out how implementing a repository that takes care of dates, currencies, patterns, and measurements can also help in preventing bugs related to date, time, and locale.

Read about the ICU Message Format to learn how your strings can contain logic, plurals, and gender variants (this can be used even in English to personalize the user experience with “Mr.” and “Mrs.,” for example).

See the first steps to create localization-ready code in Python and Go.

reddit.com
u/DaveCashewsBand — 3 days ago