Navigating Fair Housing Guardrails in LLMs

Zillow has been at the forefront of applying artificial intelligence (AI) in the real-estate domain since the release of the first Zestimate model in 2006. By staying true to our core value of “Turn On The Lights”, we have helped tens of millions of home-owners and buyers understand the potential value of their real estate property on the market. We have also provided tools, technology and data for buyers, renters and sellers to make decisions along their real-estate journey. The many ways in which we have deployed AI to empower our customers create an obligation for us to consider some of the pitfalls of using AI. Given the hundreds of millions of people searching for housing online across our Web and apps experiences, it’s vital that consumers are equipped with responsible, safe, and trustworthy tools that make the challenging home-purchase process equitable and transparent.
The rapid proliferation of large language model (LLM) technology in the past year has likely left no industry or application untouched. Early on, as our teams started learning about the many possibilities unlocked by advancements in LLMs and Generative AI, it became obvious that Real-Estate search, evaluation, and transaction experiences can greatly benefit from LLM-powered features. As with so many other companies, we quickly imagined and started exploring a conversational experience where customers could engage in dialog, ask questions just as if they’re talking to another person— and get human-like, intuitive answers in return, and refine their questions during a multi-turn conversation. While exciting and incredibly powerful, it quickly became obvious that applying LLM technology in the real-estate domain would require much more than out-of-the box solutions are offering.
This blog post discusses the potential of a LLM-powered, conversational experience in the real estate market, and the challenges that one might face due to discriminatory/offensive content and the guardrails needed to enforce good behavior. We will walk through multiple approaches, ranging from a prompt-based solution to a standalone LLM Fair Housing Compliance classifier. We will also discuss how one might go about gathering and bootstrapping training and evaluation data. Finally, we will finish by sharing our future plans, which could include open sourcing an LLM Fair Housing Classifier for broader use.
A note of warning: some of the examples included in our discussion are not polite, in so far as they reveal bias that might feel discriminatory to our readers. It is exactly this discrimination that the work presented here aims to address.
Large Language Models are a type of AI model designed to understand and generate human language. They are trained on vast amounts of text data, learning to predict the next word in a sentence based on the context of the previous words. This ability to understand context allows LLMs to generate coherent and contextually appropriate text, making them highly effective at a range of language tasks.
LLMs are revolutionizing various aspects of technology and business and can be leveraged in many innovative ways. For instance, one can use LLMs to help further automate customer service responses, which not only improves efficiency but also ensures 24/7 customer support. LLMs can also be used to generate creative content for marketing campaigns, create personalized product descriptions, and even draft code or write reports.
Furthermore, LLMs can be used to increase internal operational efficiency in domains such as data analysis. With their ability to understand and generate human language, these models can turn raw data into easy-to-understand narratives, making it easier for decision-makers to interpret complex data patterns.
In essence, LLMs are becoming an indispensable tool in our tech arsenal, enabling us to streamline operations, enhance customer experiences, and drive innovation. However, when deploying the current state of the art versions of LLMs one must be aware that these models are often domain-agnostic and have been trained on nearly all available digital text. This could prove to be a challenge in domains with specific, narrow and strict guardrails where semantics can significantly differ from other common domains. As we will explore below, the real-estate industry, and specifically assisting customers with real-estate transactions, is undoubtedly one of those more challenging domains.
What are some of the specific legal regulations that apply to the real-estate domain, and what impact and potential risks should we consider? Let’s start by reviewing the applicable laws in this area:
Combining these authorities leaves us with a set of legally protected classes that include (but are not limited to): race/color, national origin, sex (including sexual orientation and gender identity), familial status, religion, disability, age, marital status, source of income/receipt of public assistance income, criminal background, and military status.
When evaluating fair housing legal risks, with respect to LLMs that facilitate some part of a real estate transaction, a helpful analogy would be to ask “is it illegal if a real estate agent or loan officer does it?”. One illegal practice that applies to traditional real estate, and may apply to LLM behavior, is “steering”. Let’s look at this risk in more detail.
In the traditional real estate context, steering occurs when a real estate agent considers/relies upon the protected demographic characteristic(s) of their client when making a home or neighborhood recommendation, or considers/relies upon the protected demographic characteristics of the neighborhood.(3) Open housing is a fundamental purpose of the FHA, and steering has the effect not only of injuring the client, but also of perpetuating segregation.
When considering real estate LLMs, steering may occur when the model output is impacted by a protected class characteristic. Two examples of steering in the LLM context are as follows:
Any licensed real estate agent should be trained to avoid responding to such a prompt, or to provide a canned response such as “I’m sorry, I cannot assist with that request, although you are free to do your own research in order to inform your home buying search.” And while such an inquiry may be perfectly legal during an ordinary conversation, once it enters the realm of a real estate transaction, it is covered by fair housing laws.
To understand how one could train a generic LLM to comply with certain Fair Housing requirements, let’s envision a simple conversational interface that allows Zillow users to discuss real-estate topics with the objectives of understanding the process, identifying their desired location and type of home, finding the right homes to tour, getting introduced to an agent and loan officer, and eventually making an offer.
This interface enables multi-turn conversations where customers can follow up with additional questions, refine their queries, or start a new conversation. Such a conversational user experience is designed to resemble an interaction with a human real-estate agent and provide guidance to prospective customers along their real-estate journey.
There are many excellent resources and tutorials online on how to prompt an LLM to create the above described experience. The unparalleled expressive power of state-of-the-art LLMs allows us to implement the above experience — including all its complexities, within a single prompt, containing multiple instructions that are all written in natural language, and augmented with a few necessary tools and databases.
However, when prompted with input that contains legally protected characteristics, the system may respond in an unacceptable manner that is inconsistent with what is expected from a human operator. For example, when prompted with a query “Homes in a quiet location without families with young kids in Seattle” a state-of-the-art LLM model might respond with “I have found some information on homes for sale in quiet neighborhoods without families with young children. Here are a few resources you can check out …” This example would likely violate fair housing legal protections on the basis of familial status discrimination.
What is clearly required here is to make the conversational experience aware of fair housing requirements so that the appropriate guidance is followed. To achieve this, one solution would be to augment a generic LLM with implicit/explicit detection of whether the provided input might facilitate non-compliance. When non-compliance is detected, the LLM could be trained to educate a customer on the system’s inability to respond, and provide reference to Fair Housing guidelines.
For example, let’s consider the three sample user queries listed in Table 1 below:
The “Expected Result” column in the table below lists our desired outcome, with the results for the first two classified as being non-compliant and the third as being compliant.
Table 1: Sample queries and expected result
It is worth noting here that the above problem of compliance classification constitutes a common balancing act between precision and recall of the implemented method. Here, precision and recall refers to classification accuracy metrics expressing how many of the detected instances are truly negative and how many true negatives have been detected. In this context, the label “non-compliant” is the “positive” label for the sake of our discussion of precision and recall. In the practice of traditional real estate, it’s never OK to steer a client either toward one property or away from another based on legally protected characteristics.
Thus, on the one hand, maximizing recall is of the utmost importance because it ensures that Fair Housing rules are never violated. However, lower precision, which is manifest as a high rate of incorrectly flagging queries as “non-compliant” (i.e., false positives), creates a suboptimal product experience: the LLM will refuse to respond to customers’ queries, even if no Fair Housing rules are violated and the query only remotely resembles potentially non-compliant terms (as illustrated in the third example in Table 1).
In the next sections, we will walk through three distinct strategies (prompt engineering, stop list, and classifier) for ensuring that our conversational experience controls for steering risk, take note of the pros and cons of each method, and describe a comprehensive approach that combines all three strategies into a single LLM Fair Housing Guardrails system.
As pointed out earlier, the expressive power of LLMs offers a unique opportunity to ensure widespread awareness of the FHA and other anti-discrimination requirements in the digital realm. This section considers a first approach to compliance via prompt engineering strategies for the identification of discriminatory practices.
A reasonable start to training an LLM-based conversational tool might begin with an initial prompt that focuses on the LLM’s personality and attitude. It could also include the implicit context of the conversation and how to structure its response, including whether to solicit additional information. However, one should also provide guidance and clear instructions to the LLM, in order that it function as a knowledgeable, ethical, and equitable real estate assistant, maintaining compliance with fair housing regulations, while avoiding any form of discrimination. This could be simply achieved by instructing the system prompt to refrain from providing information in response to queries that reference protected classes.
The suggested guidance below could be added:
“Answer the given real-estate related questions without breaking fair housing laws or being discriminatory. You should politely refuse to provide information for queries that include references to protected classes like race, religion, sex, color, disability, national origin, familial status, gender identity, and sexual orientation due to fair housing regulations.”
The main advantage of a prompt-based approach is the simplicity of its implementation. The conversational behavior can be further refined via prompt engineering of the fair housing compliance instructions. While this approach covers and resolves a large number of clear and unambiguous compliance issues, a more thorough analysis quickly exposes some limitations.
For example, Table 2 (below) lists the LLM’s assessment of the three sample queries, with all of them being classified as non-compliant, as indicated by the response of the LLM:
“I’m sorry, but as an AI assistant developed to respect and adhere to fair housing laws, I can’t provide assistance based on race, religion, sex, color, disability, national origin, familial status, gender identity, and sexual orientation.”
Table 2: Sample queries with prompt result
It appears that the system became very cautious and strictly followed the fair housing guidance. Any semantic connection with the protected classes might result in a similar message, as indicated above. In other words, the system now has high recall with respect to detecting queries that may elicit a non-compliant response, but the precision is sub-optimal, inadvertently withholding permissible information, potentially causing service quality issues, alienating a segment of our users who already face barriers, and negatively impacting the overall user experience.
While the prompt-based approach could achieve relatively high recall, the behavior of the LLM is inherently non-deterministic, introducing variance and deviations that we cannot anticipate or control. Considering especially severe and offensive examples of fair housing violations and impermissible language, the system should ensure that these cases are handled and mitigated more explicitly and without exception. After all, LLMs are probabilistic, and there are cases where we want to ensure the right action is taken 100% of the time.
To this end, a database of words associated with various legally protected classes could be created. When used in prompts, these terms would be expected to produce non-compliant outputs. This mitigation strategy would parse the input query, looking for terms that match the stop list, using syntactic matching. If a match is found, the system would output a predefined message, citing its duty to follow Fair Housing guidance.
An obvious shortcoming of using a stop list is its reliance on a strict, lexical match without considering the context of the phrase. For example, consider the third sample query from Table 3 (below), which contains the word “disabilities”. Such a word might be used in a perfectly intuitive and compliant manner, indicating the user’s need to find homes that are accessible. However, the same phrase could be employed by a user who would like to avoid living next door to a neighbor who is disabled, in which case it would clearly be non-compliant. The stop list strategy is simply not sophisticated enough to address cases with semantic meaning that varies based on the context of the query.
A similar observation can be made for the first query, which specifically mentions the term “Indian,” indicating a reference to an ethnic demographic attribute. Putting this term on a stop list to resolve this particular case would immediately make all queries referencing “Indian” as a location name, e.g., the city of Indian Wells, CA, non-compliant, as well as any reference to points of interest such as the National Museum of the American Indian in Washington, D.C.
It is clear, therefore, that stop lists should not be the sole method that is used to prevent housing discrimination. Instead, it could serve as one facet of a comprehensive approach, working alongside manual review and automated machine learning systems to cater to the complexities and nuances of natural language. Assuming a stop list contained the phrase “do not accept Section 8” but not the terms “Indian” or “disabilities”, we would end up with the following classifications of our sample examples, as shown in Table 1.3 (below).
Consequently, this approach should be used sparingly to unambiguously address the worst offending terms, and leave nuance to other methods.
Table 3: Sample queries with stoplist result
The final strategy we would like to explore uses a machine learning model to detect possible FHA violations and maintain FHA awareness. Based on our learning from deploying prompt-based and stop-list-based solutions, one might establish the following requirements:
Based on these requirements, a fine-tuned, natural-language, sequence-classifier model could be built and deployed. To evaluate this approach we implemented a classification model based on Bidirectional Encoder Representations from Transformers (BERT), a now well-known technique introduced by Devlin et al. (2019). We fine-tuned the base sequence classification model with a binary cross entropy loss function on labeled examples from our domain, which equipped it with the capability to recognize and flag potential instances of housing discrimination.
Because there was no existing labeled dataset for the fair housing classification task, the first task was to build one, as illustrated in Figure 1. Dataset statistics are present in Table 4. The steps we followed for building the dataset are as follows:
Figure 1: Data Collection
Table 4: Dataset Statistics
The precision lift obtained by including sentence-level response data as part of the training data is shown in Figure 2 (below), using precision of the model trained without the response data as the baseline. In the figure, the precision lift is computed as the relative improvement of precision at a comparable level of recall for models with and without the response data. We evaluate the model variants on a mix of input queries and output responses. As shown, precision lift is maximized when recall is around 0.6.
Figure 2: Precision lift for model with included response data for training
Finally, we summarize in Table 6 the three approaches discussed based on desirable properties and their dominant pros and cons
Table 6: Comparison of the three compliance strategies
In the previous section, we reviewed three distinct strategies for making an LLM-based conversational real-estate assistant aware of Fair Housing Act requirements. We discussed the pros and cons of each method, and even though the Classifier method produced the expected results for all three of our sample test cases, there are still scenarios where the other methods might perform better.
Consequently, the optimal LLM Fair Housing Guardrails system should combine all three strategies into a single comprehensive system. This system has been designed and deployed internally for testing as follows:
The guardrails system serves as a crucial element in the overall LLM application. As mentioned above, it can be integrated both as a pre-processing and post-processing component. As a pre-processing component, the fair housing guardrail analyzes and categorizes user input before it is processed by the LLM. This allows for the early detection and filtering of potentially non-compliant requests, helping maintain the integrity of the system.
In its role as a post-processing component, the fair housing guardrail reviews the output from the LLM, flagging any content that might potentially violate fair housing regulations before it is displayed to the user. For flagged content, a predefined message is displayed instead of the LLM output. This double layer of scrutiny, applied both before and after the LLM’s operations, adds a robust safety net to the system. A simplified diagram of the system is shown in Figure 3 (below).
Figure 3: Fair Housing Guardrail for LLM
The components of the guardrail system can be further updated and iteratively improved based on user feedback and data sampled for human review. User feedback is invaluable for improving the components of the fair housing guardrail. It not only provides real-world examples to learn from, but also exposes a wide range of phrasings, contexts, and nuances that might not have been encountered during initial training and testing.
In addition, periodically sampling data for human review allows for the identification of false positives and false negatives. This not only helps in rectifying any immediate inaccuracies by updating the stoplist component, but also contributes additional examples for training the classifier component that are closer to the decision boundary, which aids in refining its precision over time. Through a combination of these iterative improvement practices, we can continuously enhance the guardrail’s accuracy and reliability in detecting potential fair housing violations, making it a vital tool in promoting fair housing practices while we invest in deploying LLM-based solutions in the housing domain.
Overall, the comprehensive solution described in this section allows us to leverage the expressive power of the prompt-based solution, with the explicit control of the stop list and finally with the ability to tune the precision and recall of the classifier model. It is worth noting that the standalone service can also be used in non-LLM applications requiring Natural Language Processing, for example in call transcript analytics.
The work described in this blog post constitutes an initial version of a guardrails system that is designed to make real-estate focused, conversational experiences align with anti-discrimination requirements, rendering them safe for our users to use. However, much work still needs to be done, and the opportunity for further improvements remains.
Enhancing Model Features
Our aim is to fine-tune the classifier by augmenting it with additional features that can capture more contextual nuances and complexities. This could involve experimenting with more advanced transformer architectures, expanding the features used in our model, or testing alternative classification algorithms. In particular, we will be working on ways to better handle ambiguous cases, context-dependent violations, and evolving linguistic trends in real estate communication.
Expanding Training Data
We plan to collect more diverse and comprehensive training examples to continuously teach our classifier. This could involve developing partnerships that provide us with access to additional data, or creating simulated data that can expand the classifier’s learnings. By exposing the classifier to a wider range of examples, we hope to increase its ability to accurately detect potential violations, even in subtle or complex cases.
Open Sourcing
As part of our commitment to fostering an ethical, fair, and transparent real estate industry, we are exploring the possibility of open sourcing our classifier and the supporting data, in the hope of encouraging the community towards contribution and collaboration, which could lead to further innovation and expanded applications. This would also allow other organizations to leverage our work to maintain compliance with fair housing requirements and ensure equitable treatment for all.
We envision an active, ongoing process of enhancements and applications, with the community at large playing a pivotal role in shaping the development and direction of these initiatives. We anticipate welcoming external contributions, which infuse our efforts with fresh perspectives and expanded expertise to foster a fair and equitable real estate landscape.
A special thanks to the Zillow Group Legal & Compliance Team, as well as other LLM contributors and practitioners at Zillow, for all their help with collecting data, labeling examples, testing out the system and providing feedback.