Buying and selling a home is a hassle, but it has surely gotten a lot less painful and perhaps much more of an orderly market thanks to the advent of online real estate sites like Trulia and Zillow. Trulia, founded in 2005, was one of the pioneers in playing matchmaker between sellers and buyers of homes on the web, but Zillow, founded a year later, has come out on top and a year and a half ago bought Trulia for $3.5 billion.
Zillow Group, which is headquartered in Seattle, is the result of the combination of the two companies and it brought in $679.9 million in revenues from real estate agent memberships, featured listings fees, advertising, and mortgage lending quotes last year. This business relies on a constant churn of data as homes go on the market and come off, and increasingly real estate sites are gathering and maintaining information on sites that are not for sale so prospective buyers and sellers have a better sense of the economics of the surrounding areas where they are looking to buy.
Trulia and Zillow were founded just as the big data revolution unleashed by Google’s MapReduce paper in December 2004 was being contemplated by the techies of the world and its subsequent open source implementation by Yahoo as the Hadoop distributed analytics framework two years later. Yahoo open sourced the code, and the analytics world has been forever changed. The upshot is that it takes a lot less money and a lot less iron to do some of the sophisticated data gathering and analysis that businesses such as Trulia and Zillow have as their very purpose compared to using the more traditional relational database and data warehousing techniques that prevailed at their founding and that were used for many years.
Trulia has been using Hadoop analytics and Spark in-memory overlays as its back-end systems for many years now, and both companies do a fair amount of storing of image data and various processing jobs on the Amazon Web Services public cloud. But for the moment, Zane Williamson, senior director of DevOps at Trulia, tells The Next Platform that the company runs its own Hadoop clusters, which come in various sizes and which do the daily number crunching as over 1 TB of data that comes from public records, listings, and user activities on its own site relating to homes. This data is diced and sliced and delivers personalized recommendations through the web, email, and messaging to buyers and is also used to provide assistance with sellers as they set the price for their homes.
Because this data is largely textual in nature, Trulia does not need massive Hadoop clusters, and Williamson says that Trulia has six clusters, which range in size from a dozen nodes at the smallest to 40 nodes at the largest. This is the typical size of Hadoop clusters in many enterprises today, and in this case the active dataset needed to do the matchmaking between buyers and sellers is only several hundred terabytes of aggregate capacity. (Both Zillow and Trulia have long since made the job of storing photos for properties being sold out onto the AWS cloud because supporting this in-house would be very expensive compared to what cloud providers charge.)
The Trulia Hadoop clusters run Cloudera’s CDH commercial distribution, which has a certain amount of monitoring in it, but the Trulia techies, like many other Hadoop shops, cobbled together their own monitoring and management system to try to peer into the clusters when things go wrong – and things do indeed go wrong – to try to troubleshoot performance degradation and other issues. The management stack included the typical open source tools, such as the Ganglia and Nagios monitoring systems, which each do slightly different things, coupled to a Graphite visualization front end and a homegrown Python web interface. Using such tools as a foundation, Trulia built its own monitoring tool to gather metrics and monitor workflows to help cluster administrators figure out the causes of slowdowns in the production of information that needs to be pushed out to Trulia subscribers every day.
“Building and maintaining this monitoring stack was quite a time sink in terms of engineering resources, and it was not up our alley of expertise necessarily,” Williamson tells us. “It wasn’t providing us a real intimate view into the cluster, and we didn’t have quality of service or resource management, either. It ewas a pretty difficult setup to administer and scale.”
So Trulia went on the hunt and found a set of Hadoop management and monitoring tools from Pepperdata to replace its homegrown, open source setup, which can track the workflow between the Hadoop clusters as that incremental 1 TB of data that comes in each day is processed in many ways and staged through the clusters.
Last October, Trulia acquired the Hadoop management tools from Pepperdata and added agents to its Hadoop nodes and started gathering up performance metrics as its workloads were running, and it was a bit of an eye opening as to what was really going on inside of the machines. Having gathered up performance data as its jobs were running, Trulia then activated the QoS features of the Pepperdata tools, and more recently has activated new alerting add-ons that have been added to the tools so problems can be identified in the Hadoop machines before they start causing performance issues on the distributed and interlocking applications that generate the daily updates for home buyers and sellers.
“In the past, when we had issues with a workflow or a job running on the cluster, it would take a day or more sometimes, depending on which job and the complexity of the issue, what could be causing a slowdown in our workflows,” explains Williamson. “Our workflows are serial in nature, and at any given point in that workflow, there could be a problem. Trying to drill down and figure issues out was time consuming. Once we had Pepperdata in place, we could find the weak points and fix problems orders of magnitude faster – within minutes to hours instead of days – and dial in on the issue and make the adjustments that we need.”
The Pepperdata tools keep track of the amount of CPU, memory. Disk I/O, and network I/O in Hadoop cluster nodes and watches for contention for those resources as jobs of various priorities fight for resources. If a low-priority job starts to gobble up resources, then it is throttled so the high priority jobs can complete. Sean Suchter, who as one of the leaders of the search engine team at Yahoo a decade ago when Hadoop was invented and who is now CEO and co-founder of Pepperdata, says that the company does not disclose its prices for its tools, which is based on an annual per-node subscription, but the fact that it can make fixing problems an order of magnitude better or boost Hadoop throughput by 30 percent to 50 percent helps Hadoop shops get a return on their licensing investment very quickly.