Problems with building Scalable System

kapil sharma
3 min readSep 21, 2018

--

Concurrency:

Problem 1: How to avoid processing the same job on multiple nodes

  • We can decide the job processing node based on Job data like generating hash which will decide on which node this job should be processed

Problem 2: How to generate unique number or id for each job like tiny url which will be processed on number of nodes?

  • We can keep a dedicated nodes for generating unique id. Problem here is that processing node has to wait until dedicated node return the id.
  • To solve above problem dedicated host can generate unique id in advance which will be cached by processing nodes for faster performance.

Designing Web Crawaler:

Problem 1: What is the problem with long polling?

  • How will clients maintain an open connection with the server?
  • How can server keep track of all opened connection to efficiently redirect messages to the users?
  • What will happen when the server receives a message for a user who has gone offline?
  • How many chat servers we need?
  • How to know which server holds the connection to which user?

Problem 2: How to crawl (BFS/DFS)

Problem 3: How to crawl heavy page & frequently changing pages?

Problem 4: How will you avoid overloading url frontier and avoid multiple nodes processing same url

Problem 5: How big our URL frontier would be?

Problem 6: Maintaining robot.txt rule?

Problem 7: How to handle duplicate documents?

Problem 8: DNS Problem?

Problem 9: Url dedup test and bloom filter?

Problem 10: Crawler Traps?

Facebook Chat Design

What is HBase (Used in Facebook chat)

Both of our requirements can be easily met with a wide-column database solution like HBase. HBase is a column-oriented key-value NoSQL database that can store multiple values against one key into multiple columns. HBase is modeled after Google’s BigTable and runs on top of Hadoop Distributed File System (HDFS). HBase groups data together to store new data in a memory buffer and once the buffer is full, it dumps the data to the disk This way of storage not only helps storing a lot of small data quickly but also fetching rows by the key or scanning ranges of rows. HBase is also an efficient database to store variable size data, which is also required by our service.

Problem: How will you maintain state of online users and broadcast when offline(Facebook Chat)

Twitter Search:

Twitter search is for searching status of the people, place, city etc.

  • To store status, generate the hash which will help in finding specific server to store status.
  • When someone search for status, we again generate the hash of searched status and go to specific server.
  • Rest of the things are same as generating sharding and indexs. How big the index would be etc.

Problems 1: How to maintain the ranking?

  • Server can rank it based on number of likes, views, popularity etc

Problem 2: What if a topic/status becomes hot?

  • This is where caching play an important role.

--

--

No responses yet