Assume that a growing enterprise has outgrown its current computer system and is purchasing a new parallel computer. If the growth has resulted in many more transactions per unit time, but the length of individual transactions has not changed, what measure is most relevant — speedup, batch scale up, or trans- action scale up? Why?
> Explain what application characteristics would help you decide which of TPC- C, TPC-H, or TPC-R best models the application.
> Why was the TPC-D benchmark replaced by the TPC-H and TPC-R bench- marks?
> List at least four features of the TPC benchmarks that help make them realistic and dependable measures.
> Suppose the price of memory falls by half, and the speed of disk access (number of accesses per second) doubles, while all other factors remain the same. What would be the effect of this change on the 5-minute and 1-minute rule?
> What is the motivation for splitting a long transaction into a series of small ones? What problems could arise as a result, and how can these problems be averted?
> Show that, in SQL, all is identical to not in.
> The Google search engine provides a feature whereby web sites can display advertisements supplied by Google. The advertisements supplied are based on the contents of the page. Suggest how Google might choose which advertisements to supply for a page, giv
> Suppose that your application has transactions that each access and update some that all internal nodes of the B+-tree are in memory, but only a very small fraction of the leaf pages can fit in memory. Explain how to calculate the minimum number of disks
> When carrying out performance tuning, should you try to tune your hardware (by adding disks or memory) first, or should you try to tune your transactions (by adding indices or materialized views) first. Explain your answer.
> Database tuning: a. What are the three broad levels at which a database system can be tuned to improve performance? b. Give two examples of how tuning can be done for each of the levels.
> Our description of static hashing assumes that a large contiguous stretch of disk blocks can be allocated to a static hash table. Suppose you can allocate only C contiguous blocks. Suggest how to implement the hash table, if it can be much larger than C
> Why is a hash structure not the best choice for a search key on which range queries are likely?
> What are the causes of bucket overflow in a hash file organization? What can be done to reduce the occurrence of bucket overflows?
> Explain the distinction between closed and open hashing. Discuss the relative merits of each technique in database applications.
> Suppose you want to use the idea of a quad tree for data in three dimensions. How would the resultant data structure (called an cotter) divide up space?
> The stepped merge variant of the LSM tree allows multiple trees per level. What are the tradeoffs in having more trees per level?
> For correct execution of a replicated state machine, the actions must be deterministic. What could happen if an action is non-deterministic?
> Web sites that want to get some publicity can join a web ring, where they create links to other sites in the ring in exchange for other sites in the ring creating links to their site. What is the effect of such rings on popularity ranking techniques such
> Write the following queries in SQL, using the university schema. A. Find the ID and name of each student who has taken at least one Comp. Sci. course; make sure there are no duplicate names in the result. b. Find the ID and name of each student who has n
> Why is the notion of term important when an election is used to choose a coordinator? What are the analogies between elections with terms and elections used in a democracy?
> Markel trees can be made short and fat (like B+-trees) or thin and tall (like binary search trees). Which option would be better if you are comparing data across two sites that are geographically separated, and why?
> Spanner provides read-only transactions a snapshot view of data, using multi- version two-phase locking. a. In the centralized multi-version 2PL scheme, read-only transactions never wait. But in Spanner, reads may have to wait. Explain why. b. Using an o
> Discuss the advantages and disadvantages of the two methods that we presented in Section 23.3.4 for generating globally unique timestamps.
> If we apply a distributed version of the multiple-granularity protocol of Chapter 18 to a distributed database, the site responsible for the root of the DAG may become a bottleneck. Suppose we modify that protocol as follows: • Only intention-mode locks
> In the majority protocol, what should the reader do if it finds different values from different copies, to (a) decide what is the correct value, and (b) to bring the copies back to consistency? If the reader does not bother to bring the copies back to consi
> Give an example where the read one, write all available approach leads to an erroneous state.
> What characteristics of an application make it easy to scale the application by using a key-value store, and what characteristics rule out deployment on key-value stores?
> Consider system that is processing a stream of tuples for a relation r with attributes (A, B, C, timestamp) Suppose the goal of a parallel stream processing system is to compute the number of tuples for each A value in each 5 minute window (based on the
> Suppose you wish to perform keyword querying on a set of tuples in a database, where each tuple has only a few attributes, each containing only a few words. Does the concept of term frequency make sense in this context? And that of inverse document frequ
> The attribute on which a relation is partitioned can have a significant impact on the cost of a query. a. Given a workload of SQL queries on a single relation, what attributes would be candidates for partitioning? b. How would you choose between the alter
> Using the university schema, write an SQL query to find section(s) with max- imam enrollment. The result columns should appear in the order “coursed, secede, year, semester, numb”. (It may be convenient to use the with construct.)
> What is the motivation for work-stealing with virtual nodes in a shared-memory setting? Why might work-stealing not be as efficient in a shared-nothing set- ting?
> Suppose you wish to handle a workload consisting of a large number of small transactions by using shared-nothing parallelism. a. Is intra query parallelism required in such a situation? If not, why, and what form of parallelism is appropriate? b. What fo
> Describe a good way to parallelize each of the following: a. The difference operation b. Aggregation by the count operation c. Aggregation by the count distinct operation d. Aggregation by the age operation e. Left outer join, if the join condition involv
> Can partitioned join be used for r ⋈r? A
> Joins can be expensive in a key-value store, and difficult to express if the system does not support SQL or a similar declarative query language. What can an application developer do to efficiently get results of join or aggregate queries in such a setting?
> Why is it easier for a distributed file system such as GFS or HDFS to support replication than it is for a key-value store?
> What is the motivation for storing related records together in a key-value store? Explain the idea using the notion of an entity group.
> What factors could result in skew when a relation is partitioned on one of its attributes by: a. Hash partitioning? b. Range partitioning? In each case, what can be done to reduce the skew?
> Consider the E-R diagram in Figure 8.9, which contains specializations, using subtypes and sub tables. a. Give an SQL schema definition of the E-R diagram. b. Give an SQL query to find the names of all people who are not secretaries. c. Give an SQL query t
> For each of the three partitioning techniques, namely, round-robin, hash partitioning, and range partitioning, give an example of a query for which that partitioning technique would provide the fastest response.
> Suppose that a major database vendor offers its database system (e.g., Oracle, SQL Server DB2) as a cloud service. Where would this fit among the cloud- service models? Why?
> Using the university schema, write an SQL query to find the number of students in each section. The result columns should appear in the order “coursed, secede, year, semester, numb”. You do not need to output sections with 0 students.
> In a shared-nothing system data access from a remote node can be done by remote procedure calls, or by sending messages. But remote direct memory access (RDMA) provides a much faster mechanism for such data access. Ex- plain why.
> Assume we have data items d1, d2, d n with each di protected by a lock stored in memory location Mi. a. Describe the implementation of lock-X (di) and unlock (di) via the use of the test-and-set instruction. b. Describe the implementation of lock-X (di)
> Memory systems today are divided into multiple modules, each of which can be serving a separate request at a given time, in contrast to earlier architectures where there was a single interface to memory. What impact has such a memory architecture have on
> What are the factors that can work against linear scale up in a transaction processing system? Which of the factors are likely to be the most important in each of the following architectures: shared-memory, shared disk, and shared nothing?
> Is it wise to allow a user process to access the shared-memory area of a database system? Explain your answer.
> Database systems are typically implemented as a set of processes (or threads) accessing shared memory. a. How is access to the shared-memory area controlled? b. Is two-phase locking appropriate for serializing access to the data structures in shared memo
> Consider the schemas for the table people, and the table’s students and teachers, which were created under people, in Section 8.2.1.3. Give a relational schema in third normal form that represents the same information. Recall the constraints on sub table
> If an enterprise uses its own ERP application on a cloud service under the platform-as-a-service model, what restrictions would there be on when that enterprise may upgrade the ERP system to a new version?
> Consider a bank that has a collection of sites, each running a database system. Suppose the only way the databases interact is by electronic transfer of money between themselves, using persistent messaging. Would such a system qualify as a distributed da
> Suppose there is a transaction that has been running for a very long time but has performed very few updates. a. What effect would the transaction have on recovery time with the recovery algorithm of Section 19.4, and with the ARIES recovery algorithm? b.
> Using the university schema, write an SQL query to find the ID and title of each course in Comp. Sci. that has had at least one section with afternoon hours (i.e., ends at or after 12:00). (You should eliminate duplicates if any.)
> Consider the log in Figure 19.5. Suppose there is a crash just before the log
> Explain why logical undo logging is used widely, whereas logical redo logging (other than physiological redo logging) is rarely used.
> Physiological redo logging can reduce logging overheads significantly, especially with a slotted page record organization. Explain why.
> Suppose two-phase locking is used, but exclusive locks are released early, that is, locking is not done in a strict two-phase manner. Give an example to show why transaction rollback can result in a wrong final state, when using the log- based recovery al
> Outline the drawbacks of the no-steal and force buffer management policies.
> Explain how the database may become inconsistent if some log records pertaining to a block are not output to stable storage before the block is output to disk.
> Redesign the database of Exercise 8.4 into first normal form and fourth normal form. List any functional or multivalued dependencies that you assume. Also list all referential-integrity constraints that should be present in the first and fourth normal form
> Stable storage cannot be implemented. a. Explain why it cannot be. b. Explain how database systems deal with this problem
> For each of the following requirements, identify the best choice of degree of durability in a remote backup system: a. Data loss must be avoided, but some loss of availability may be tolerated. b. Transaction commit must be accomplished quickly, even at
> Explain the difference between a system crash and a “disaster.”
> In the ARIES recovery algorithm: a. If at the beginning of the analysis pass, a page is not in the checkpoint dirty page table, will we need to apply any redo records to it? Why? b. What is Rec LSN, and how is it used to minimize unnecessary redoes?
> Rewrite the preceding query, but also ensure that you include only instructors who have given at least one other non-null grade in some course.
> Compare log-based recovery with the shadow-copy scheme in terms of their overheads for the case when data are being added to newly allocated disk pages (in other words, there is no old value to be restored in case the transaction aborts).
> Consider the log in Figure 19.7. Suppose there is a crash during recovery, just before the operation abort log record is written for operation O1. Explain what will happen when the system recovers again.
> Explain the difference between the three storage types — volatile, nonvolatile, and stable— in terms of I/O cost.
> Suppose the lock hierarchy for a database consists of database, relations, and tuples. a. If a transaction needs to read a lot of tuples from a relation r, what locks should it acquire? b. Now suppose the transaction wants to update a few of the tuples i
> The multiple-granularity protocol rules specify that a transaction Ti can lock a node Q in S or IS mode only if Ti currently has the parent of Q locked in either IX or IS mode. Given that SIX and S locks are stronger than IX or IS locks, why does the pro
> Describe the differences in meaning between the terms relation and relation schema.
> Although SIX mode is useful in multiple-granularity locking, an exclusive and intention-shared (XIS) mode is of no use. Why is it useless?
> In multiple-granularity locking, what is the difference between implicit and explicit locking?
> If deadlock is avoided by deadlock-avoidance schemes, is starvation still possible? Explain your answer.
> Under what conditions is it less expensive to avoid deadlock than to allow deadlocks to occur and then to detect them?
> Consider a variant of the tree protocol called the forest protocol. The database is organized as a forest of rooted trees. Each transaction Ti must follow the following rules: • The first lock in each tree may be on any data item. • The second, and all su
> Using the university schema, write an SQL query to find the ID and name of each instructor who has never given an A grade in any course she or he has taught. (Instructors who have never taught a course trivially satisfy this condition.)
> Consider the following locking protocol: All items are numbered, and once an item is unlocked, only higher-numbered items may be locked. Locks may be released at any time. Only X-locks are used. Show by an example that this protocol does not guarantee se
> Most implementations of database systems use strict two-phase locking. Suggest three reasons for the popularity of this protocol.
> Many transactions update a common item (e.g., the cash balance at a branch) and private items (e.g., individual account balances). Explain how you can in- crease concurrency (and throughput) by ordering the operations of the trans- action.
> Give example schedules to show that with key-value locking, if lookup, insert, or delete does not lock the next-key value, the phantom phenomenon could go undetected.
> Show that the following decomposition of the schema R of Exercise 7.1 is not a lossless decomposition:
> Explain the reason for the use of degree-two consistency. What disadvantages does this approach have?
> Explain the phantom phenomenon. Why may this phenomenon lead to an incorrect concurrent execution despite the use of the two-phase locking proto- col?
> Consider a relation r (A, B, C) and a transaction T that does the following: find maximum A value. Assume that an index is used to find the maximum a value. a. Suppose that the transaction locks each tuple it reads in S mode, and the tuple it creates in X
> Outline the key similarities and differences between the timestamp-based implementation of the first-committer-wins version of snapshot isolation, de- scribed in Exercise 18.15, and the optimistic-concurrency control-without-read- validation scheme, descri
> As discussed in Exercise 18.15, snapshot isolation can be implemented using a form of timestamp validation. However, unlike the multisession timestamp- ordering scheme, which guarantees serialize ability, snapshot isolation does not guarantee serialize a
> Under a modified version of the timestamp protocol, we require that a commit bit be tested to see whether a read request must wait. Explain how the commit bit can prevent cascading abort. Why is this test not necessary for write requests?
> Consider the following SQL query on the university schema: Select avg (salary) - (sum(salary) / count(*)) From instructor We might expect that the result of this query is zero since the average of a set of numbers is defined to be the sum of the numbers d
> List four significant differences between a file-processing system and a DBMS.
> Show that there are schedules that are possible under the two-phase locking protocol but not possible under the timestamp protocol, and vice versa.
> When a transaction is rolled-back under timestamp ordering, it is assigned a new timestamp. Why can it not simply keep its old timestamp?
> Using the functional dependencies of Exercise 7.6, compute B+.
> What benefit does strict two-phase locking provide? What disadvantages result?
> For each of the following isolation levels, give an example of a schedule that respects the specified level of isolation but is not serialize able: a. Read uncommitted b. Read committed c. Repeatable read
> Explain why the read-committed isolation level ensures that schedules are cascade-free.
> Why do database systems support concurrent execution of transactions, de- spite the extra effort needed to ensure that concurrent execution does not cause any problems?
> What is a recoverable schedule? Why is recoverability of schedules desirable? Are there any circumstances under which it would be desirable to allow non- recoverable schedules? Explain your answer.
> Give an example of a serialize able schedule with two transactions such that the order in which the transactions commit is different from the serialization order.