2.99 See Answer

Question: The map-reduce framework is quite useful


The map-reduce framework is quite useful for creating inverted indices on a set of documents. An inverted index stores for each word a list of all document IDs that it appears in (offsets in the documents are also normally stored, but we shall ignore them in this question).
For example, if the input document IDs and contents are as follows:
1: data clean
2: data base
3: clean base
Then the inverted lists would data: 1, 2
Clean: 1, 3
Base: 2, 3
Give pseudo code for map and reduce functions to create inverted indices on a given set of files (each file is a document). Assume the document ID is available using a function context. Get Document ID (), and the map function is invoked once per line of the document. The output inverted list for each word should be a list of document IDs separated by commas. The document IDs are normally sorted, but for the purpose of this question you do not need to bother to sort them.


> Explain how to split the hybrid hash-join operator into sub-operators to model pipelining. Also explain how this split is different from the split for a hash-join operator.

> Write pseudo code for an iterator that implements a version of the sort – merge algorithm where the result of the final merge is pipelined to its consumers. Your pseudo code must define the standard iterator functions open (), next (), and close (). Show w

> Suppose you have to compute a sum(C) (r) as well as A, by sum(C) (r). Describe how to compute these together using a single sorting of r.

> The hash-join algorithm as described in Section 15.5.5 computes the natural join of two relations. Describe how to extend the hash-join algorithm to compute the natural left outer join, the natural right outer join, and the natural full outer join. (Hint

> Use the definition of functional dependency to argue that each of Armstrong’s axioms (reflexivity, augmentation, and transitivity) is sound.

> List two reasons why null values might be introduced into a database.

> Estimate the number of block transfers and seeks required by your solution to Exercise 15.19 for r1 ⋈ r2, where r1 and r2 are as defined in Exercise 15.3.

> Design a variant of the hybrid merge-join algorithm for the case where both relations are not physically sorted, but both have a sorted secondary index on the join attributes.

> Why is it not desirable to force users to make an explicit choice of a query- processing strategy? Are there cases in which it is desirable for users to be aware of the costs of competing query-processing strategies? Explain your answer.

> Suppose you need to sort a relation of 40 gigabytes, with 4-kilobyte blocks, using a memory size of 40 megabytes. Suppose the cost of a seek is 5 milliseconds, while the disk transfer rate is 40 megabytes per second. a. Find the cost of sorting the relat

> An existence bitmap has a bit for each record position, with the bit set to 1 if the record exists, and 0 if there is no record at that position (for example, if the record were deleted). Show how to compute the existence bitmap from other bitmaps. Make

> What trade-offs do write-optimized indices pose as compared to B+-tree indices?

> Suppose a relation is stored in a B+-tree file organization. Suppose secondary indices store record identifiers that are pointers to records on disk. a. What would be the effect on the secondary indices if a node split happened in the file organization? b. W

> Suppose you have to create B+-tree index on a large number of names, where the maximum size of a name may be quite large (say 40 characters) and the average name is itself large (say 10 characters). Explain how prefix compression can be used to maximize t

> Suppose there is a relation r (A, B, C), with a B+-tree index with search key (A, B). a. What is the worst-case cost of finding records satisfying 10 < A < 50 using this index, in terms of the number of records retrieved n1 and the height h of the tree? b

> Why certain functional dependencies are called trivial functional dependencies?

> The solution presented in Section 14.3.5 to deal with non-unique search keys added an extra attribute to the search key. What effect could this change have on the height of the B+-tree?

> Consider the bank database of Figure 2.18. Give an expression in the relational algebra for each of the following queries: a. Find each loan number with a loan amount greater than $10000. b. Find the ID of each depositor who has an account with a balance

> For each B+-tree of Exercise 14.3, show the steps involved in the following queries: a. Find records with a search-key value of 11. b. Find records with a search-key value between 7 and 17, inclusive.

> What is the difference between a clustering index and a secondary index?

> Some attributes of relations may contain sensitive data, and may be required to be stored in an encrypted fashion. How does data encryption affect index schemes? In particular, how might it affect schemes that attempt to store data in sorted order?

> Spatial indices that can index spatial intervals can conceptually be used to index temporal data by treating valid time as a time interval. What is the problem with doing so, and how is the problem solved?

> When is it preferable to use a dense index rather than a sparse index? Explain your answer.

> Standard buffer managers assume each block is of the same size and costs the same to read. Consider a buffer manager that, instead of LRU, uses the rate of reference to objects, that is, how often an object has been accessed in the last n seconds. Suppose

> Give a normalized version of the Index metadata relation, and explain why using the normalized version would result in worse performance.

> In the sequential file organization, why is an overflow block used even if there is, at the moment, only one overflow record?

> Explain what is meant by repetition of information and inability to represent in- formation. Explain why each of these properties may indicate a bad relational- database design.

> List two advantages and two disadvantages of each of the following strategies for storing a relational database: a. Store each relation in one file. b. Store multiple relations (perhaps even the entire database) in one file.

> Explain why the allocation of records to blocks affects database-system performance significantly.

> Consider the employee database of Figure 2.17. Give an expression in the relational algebra to express each of the following queries: a. Find the ID and name of each employee who works for “Big Bank”. b. Find the ID, name, and city of residence of each e

> In the variable-length record representation, a null bitmap is used to indicate if an attribute has the null value. a. For variable-length fields, if the value is null, what would be stored in the offset and length fields? b. In some applications, tuples ha

> Suppose you have data that should not be lost on disk failure, and the application is write-intensive. How would you store the data?

> What is scrubbing, in the context of RAID systems, and why is scrubbing important?

> RAID systems typically allow you to replace failed disks without stopping access to the system. Thus, the data in the failed disk must be rebuilt and written to the replacement disk while the system is in operation. Which of the RAID levels yields the le

> Operating systems try to ensure that consecutive blocks of a file are stored on consecutive disk blocks. Why is doing so very important with magnetic disks? If SSDs were used instead, is doing so still important, or is it irrelevant? Explain why.

> How does the remapping of bad sectors by disk controllers affect data-retrieval rates?

> List the physical storage media available on the computers you use routinely. Give the speed with which data can be accessed on each medium.

> Given two relations r(A, B, valid time) and s(B, C, valid time), where valid time de- notes the valid time interval, write an SQL query to compute the temporal Nat intervals overlap and the ∗ operator to compute the intersection of two intermural join of

> Suggest how predictive mining techniques can be used by a sports team, using your favorite sport as an example.

> The organization of parts, chapters, sections, and subsections in a book is related to clustering. Explain why, and to what form of clustering.

> Suppose half of all the transactions in a clothes shop purchase jeans, and one- third of all transactions in the shop purchase T-shirts. Suppose also that half of the transactions that purchase jeans also purchase T-shirts. Write down all the (nontrivial

> Construct a schema diagram for the bank database of Figure 2.18.

> Consider the star schema from Figure 11.2. Suppose an analyst finds that monthly total sales (sum of the price values of all sales tuples) have decreased, instead of growing, from April 2018 to May 2018. The analyst wishes to check if there are specific it

> Consider each of the takes and teaches relations as a fact table; they do not have an explicit measure attribute, but assume each table has a measure attribute rig count whose value is always 1. What would the dimension attributes and dimension tables be

> Why is column-oriented storage potentially advantageous in a database system that supports a data warehouse?

> Explain how multiple operations can be executed on a stream using a publish subscribe system such as Apache Kafka.

> Suppose a stream can deliver tuples out of order with respect to tuple times- tamps. What extra information should the stream provide, so a stream query processing system can decide when all tuples in a window have been seen?

> Fill in the blanks below to complete the following Apache Spark program which computes the number of occurrences of each word in a file. For simplicity we assume that words only occur in lowercase, and there are no punctuation marks. Java RDD text File =

> Although SQL does not support functional dependency constraints, if the database system supports constraints on materialized views, and materialized views are maintained immediately, it is possible to enforce functional dependency constraints in SQL. Giv

> Suppose your company has built a database application that runs on a centralized database, but even with a high-end computer and appropriate indices created on the data, the system is not able to handle the transaction load, leading to slow processing of

> One of the characteristics of Big Data is the variety of data. Explain why this characteristic has resulted in the need for languages other than SQL for processing Big Data.

> Give four ways in which information in web logs pertaining to the web pages visited by a user can be used by the web site.

> Consider the bank database of Figure 2.18. Assume that branch names and customer names uniquely identify branches and customers, but loans and accounts can be associated with more than one customer. a. What are the appropriate primary keys? b. Given your

> What is multifactor authentication? How does it help safeguard against stolen passwords?

> a. What is an XSS attack? b. How can the referrer field be used to detect some XSS attacks? XSS attacks:

> Many web sites today provide rich user interfaces using Ajax. List two features each of which reveals if a site uses Ajax, without having to look at the source code. Using the above features, find three sites which use Ajax; you can view the HTML source o

> Explain the terms CRUD and REST.

> Write pseudo code to manage a connection pool. Your pseudo code must include a function to create a pool (providing a database connection string, database user name, and password as parameters), a function to request a connection from the pool, a connect

> Normalize the following schema, with given constraints, to 4NF.

> What is an SQL injection attack? Explain how it works and what precautions must be taken to prevent SQL injection attacks.

> Explain why 4NF is a normal form more desirable than BCNF.

> Given a relational schema r (A, B, C, D), does A →→ BC logically imply A →→ B and A →→ C? If yes prove it, or else give a counter example.

> Give a lossless, dependency-preserving decomposition into 3NF of schema R of Exercise 7.1.

> Given the three goals of relational database design, is there any reason to design a database schema that is in 2NF, but is in no higher-order normal form? (See Exercise 7.19 for the definition of 2NF.)

> Write a servlet that authenticates a user (based on user names and passwords stored in a database relation) and sets a session variable called use rid after au- then taxation.

> In designing a relational database, why might we choose a non-BCNF design?

> List the three design goals for relational databases, and explain why each is desirable.

> Show that every schema consisting of exactly two attributes must be in BCNF regardless of the given set F of functional dependencies.

> Although the BCNF algorithm ensures that the resulting decomposition is loss- less, it is possible to have a schema and a decomposition that was not generated by the algorithm that is in BCNF, and is not lossless. Give an example of such a schema and its

> Consider the schema R = (A, B, C, D, E, G, and H) and the set F of functional dependencies: Use the 3NF decomposition algorithm to generate a 3NF decomposition of R, and show your work. This means: a. A list of all candidate keys b. A canonical cover for

> Consider the schema R = (A, B, C, D, E, and G) and the set F of functional dependencies: Use the 3NF decomposition algorithm to generate a 3NF decomposition of R, and show your work. This means: a. A list of all candidate keys b. A canonical cover for F,

> Explain why No SQL systems emerged in the 2000s, and briefly contrast their features with traditional database systems.

> Consider the schema R = (A, B, C, D, E, and G) and the set F of functional dependencies: a. Find a nontrivial functional dependency containing no extraneous at- tributes that is logically implied by the above three dependencies and ex- plain how you foun

> Consider the schema R = (A, B, C, D, E, G) and the set F of functional dependencies: R is not in BCNF for many reasons, one of which arises from the functional dependency AB &acirc;&#134;&#146; CD. Explain why AB &acirc;&#134;&#146; CD shows that R is no

> Consider the following set F of functional dependencies on the relation schema (A, B, C, D, E, and G): a. Compute B+. b. Prove (using Armstrong&acirc;&#128;&#153;s axioms) that AG is a super key. c. Compute a canonical cover for this set of functional de

> Write a servlet and associated HTML code for the following simple application: A user is allowed to submit a form containing a number, say n, and should get a response saying how many times the value n has been submitted previously. The number of times e

> Give a lossless decomposition into BCNF of schema R of Exercise 7.1.

> Design a database for an automobile company to provide to its dealers to assist them in maintaining customer records and dealer inventory and to assist sales staff in ordering cars. Each vehicle is identified by a vehicle identification number (VIN). Each i

> Consider the E-R diagram in Figure 6.30, which models an online bookstore. a. Suppose the bookstore adds Blu-ray discs and downloadable video to its collection. The same item may be present in one or both formats, with differing prices. Draw the part of

> Construct appropriate relation schemas for each of the E-R diagrams in: a. Exercise 6.1. b. Exercise 6.2. c. Exercise 6.3. d. Exercise 6.15.

> We can convert any weak entity set to a strong entity set by simply adding appropriate attributes. Why, then, do we have weak entity sets?

> Consider two entity sets A and B that both have the attribute X (among others whose names are not relevant to this question). a. If the two X s are completely unrelated, how should the design be improved? b. If the two X s represent the same property and

> Explain the difference between a weak and a strong entity set.

> List two features developed in the 2000s and that help database systems handle data-analytics workloads.

> Extend the E-R diagram of Exercise 6.3 to track the same information for all teams in a league.

> Construct an E-R diagram for a hospital with a set of patients and a set of medical doctors. Associate with each patient a log of the various tests and examinations conducted.

> Explain what a challenge– response system for authentication is. Why is it more secure than a traditional password-based system?

> Explain the distinction between total and partial constraints.

> Explain the distinction between disjoint and overlapping constraints.

> Design a generalization&acirc;&#128;&#147; specialization hierarchy for a motor vehicle sales company. The company sells motorcycles, passenger cars, vans, and buses. Justify Your placement of attributes at each level of the hierarchy. Explain why they s

> In Section 6.9.4, we represented a ternary relationship (repeated in Figure 6.29a) using binary relationships, as shown in Figure 6.29b. Consider the alternative shown in Figure 6.29c. Discuss the relative merits of these two alternative representations

> Design a database for an airline. The database must keep track of customers and their reservations, flights and their status, seat assignments on individual flights, and the schedule and routing of future flights. Your design should include an E-R diagram,

> Design a database for a worldwide package delivery company (e.g., DHL or FedEx). The database must be able to keep track of customers who ship items And customers who receive items; some customers may do both. Each package must be identi&iuml;&not;&#129;

> Explain the distinctions among the terms primary key, candidate key, and super key.

> The execution of a trigger can cause another action to be triggered. Most database systems place a limit on how deep the nesting can be. Explain why they might place such a limit.

> Explain the difference between two-tier and three-tier application architectures. Which is better suited for web applications? Why?

> Suppose there are two relations r and s, such that the foreign key B of r references the primary key A of s. Describe how the trigger mechanism can be used to implement the on delete cascade option when a tuple is deleted from s.

> Hackers may be able to fool you into believing that their web site is actually a web site (such as a bank or credit card web site) that you trust. This may be done by misleading email, or even by breaking into the network infrastructure and rerouting net

> Redo Exercise 5.12 using the language of your database system for coding stored procedures and functions. Note that you are likely to have to consult the online Documentation for your system as a reference, since most systems use syntax di&iuml;&not;&#12

> Consider the relational schema from Exercise 5.16. Write a JDBC function using non recursive SQL to find the total cost of part “P-100”, including the costs of all its subparts. Be sure to take into account the fact that a part may have multiple occurrenc

2.99

See Answer