Fitchwood Insurance Company, which is involved primarily in the sale of annuity products, would like to design a data mart for its sales and marketing organization. Presently, the OLTP system is a legacy system residing on a shared network drive consisting of approximately 600 different flat files. For the purposes of our case study, you can assume that 30 different flat files are going to be used for the data mart. Some of these flat files are transaction files that change constantly. The OLTP system is shut down overnight on Friday evening beginning at 6 p.m. for backup. During that time, the flat files are copied to another server, an extraction process is run, and the extracts are sent via FTP to a UNIX server. A process is run on the UNIX server to load the extracts into Oracle and rebuild the star schema. For the initial loading of the data mart, all information from the 30 files was extracted and loaded. On a weekly basis, only additions and updates will be included in the extracts.
Although the data contained in the OLTP system are broad, the sales and marketing organization would like to focus on the sales data only. Sales and marketing is interested in viewing all sales data by territory, effective date, type of policy, and face value. In addition, the data mart should be able to provide reporting by individual agent on sales as well as commissions earned. Occasionally, the sales territories are revised (i.e., zip codes are added or deleted). The Last Redistrict attribute of the Territory table is used to store the date of the last revision. Some sample queries and reports are listed here:
• Total sales per month by territory, by type of policy.
• Total sales per quarter by territory, by type of policy.
• Total sales per month by agent, by type of policy.
• Total sales per month by agent, by zip code.
• Total face value of policies by month of effective date.
• Total face value of policies by month of effective date, by agent.
• Total face value of policies by quarter of effective date.
• Total number of policies in force, by agent.
• Total number of policies not in force, by agent.
• Total face value of all policies sold by an individual agent.
• Total initial commission paid on all policies to an agent.
• Total initial commission paid on policies sold in a given month by agent.
• Total commissions earned by month, by agent.
• Top-selling agent by territory, by month.
Commissions are paid to an agent on the initial sale of a policy. The InitComm field of the policy table contains the percentage of the face value paid as an initial commission. The Commission field contains a percentage that is paid each month as long as a policy remains active or in force. Each month, commissions are calculated by computing the sum of the commission on each individual policy that is in force for an agent.
Required:
What types of data pollution/cleansing problems might occur with the Fitchwood OLTP system data?
> Answer the following questions concerning Figure 2-22: a. Where is a unary relationship, what does it mean, and for what reasons might the cardinalities on it be different in other organizations? b. Why is Includes a one-to-many relationship, and why mig
> Based on the table above as well as additional research, write a memo in support of or against the following statement: “Cloud databases will increasingly eliminate the need for data administrators/DBAs in corporations.”
> Visit the Web sites of one or more popular cloud service providers that provide cloud database services. Use the table below to map the features listed on the Web site to the major concepts covered in this chapter. If you are not sure where to start, try
> The average annual revenue per customer for the mail order firm described in Problems and Exercises 12-33 and 12-35 is $100. The organization is planning a data quality improvement program that it hopes will increase the average revenue per customer by 5
> The mail order firm described in Problem and Exercise 12-33 has about 1 million customers. The firm is planning a mass mailing of its spring sales catalog to all of its customers. The unit cost of the mailing (postage and catalog) is $6.00. The error rat
> Black Friday is one of the busiest and most profitable times for online retailers due to the traffic generated by price reductions online. On November 24, 2017, a number of Web sites belonging to major online retailers experienced a disruption of service
> You are now ready to create to a proof of concept system for FAME. Create your proof of concept using your technological recommendations (or using the environment that your instructor asks you to use).
> An e-business operates a high-volume catalog sales center. Through the use of clustered servers and mirrored disk drives, the data center has been able to achieve data availability of 99.5 percent. Although this exceeds industry norms, the organization s
> You have been asked to write a brief report on how TQM can be adopted by your organization to improve data quality. Produce a list of reasons why TQM should and should not be adopted, and recommend, with an explanation, an alternative approach to data qu
> Design an interface that would enable the capture of high-quality and error-free data.
> Referring to Problem and Exercise 12-28, rank the four candidates for the position of DBA at Metro Marketing. Again, support your rankings. Data from Problem and Exercise 12-28: Metro Marketers, Inc., wants to build a data warehouse for storing customer
> Referring to Problem and Exercise 12-28, rank the four candidates for the position of data warehouse administrator at Metro Marketing. Again, support your rankings. Data from Problem and Exercise 12-28: Metro Marketers, Inc., wants to build a data wareh
> Metro Marketers, Inc., wants to build a data warehouse for storing customer information that will be used for data marketing purposes. Building the data warehouse will require much more capacity and processing power than it has previously needed, and it
> In light of increasing legislation dictating how an organization is to store data, what would be your requirements for the role of chief data officer?
> The Pine Valley databases for this textbook (one small version illustrated in queries throughout the text and a larger version) are available to your instructor to download from the text’s Web site. Your instructor can make those databases available to y
> Examine the set of activities in Table 12-2 and categorize them as belonging to one of the following categories: people (“who”), process (“how”), and technology (“w
> Any successful data governance program needs to address the people (“who”), process (“how”), and technology (“what”) aspects. Based on your reading this chapter, provide some examples for each of these categories.
> You are now ready to create to a proof of concept system for FAME. Provide a document that provides your recommendation on the set of technologies (DBMS, programming language, Web server [if appropriate]) that you believe are best suited for FAME. Ensure
> Read an SAS white paper (www.sas.com/resources/whitepaper/wp_56343.pdf) on the use of telematics in car insurance. If Fitchwood started to use one of these technologies, what consequences would it have for its IT infrastructure needs?
> Fitchwood is a relatively small company (annual premium revenues less than $1 billion per year) that insures slightly more than 500,000 automobiles and about 200,000 homes. For what types of purposes might Fitchwood want to use big data technologies (i.e
> Text mining is an increasingly important subcategory of data mining. Can you identify potential uses of text mining in the context of an insurance company?
> Do you see any opportunities for data mining using the Fitchwood data mart? Research data mining tools and recommend one or two for use with the data mart.
> Suggest some visualization options that Fitchwood managers might want to use to support their decision making.
> Fitchwood management would like to use the data mart for drill-down online reporting. For example, a sales manager might want to view a report of total sales for an agent by month and then drill down into the individual types of policies to see how sales
> Review the white paper that has been used as a source for Figure 10-33. Which of the following tasks is the responsibility of data platform, integrated data warehouse, and integrated discovery platform, respectively? a. Finding new, previously unknown re
> For each scenario listed below, identify the following: the type of business analytics, the era of BI&A, the goal of data mining (if applicable), and whether and how big data and analytics have the potential to bring about change in the listed scenario.
> Consider the customer table created in Figure 10-24 and populated with data as shown in Figure 10-27. Write the Hive script that will display the age-groups that exist in the data set and their average incomes. Data from Figure 10-24: Data from Figure
> Use the Internet to browse the features and offerings of Big Data platforms such as HAVEn and Aster. Prepare a report of your findings.
> You are now ready to create to a proof of concept system for FAME. Revisit your deliverable for question 1-52, Chapter 1, and reread the case descriptions in Chapters 1 through 3 with an eye toward identifying the functionality you want to provide to the
> Write two HIVE queries, the first to create a PRODUCT table with fields ProdID, Name, Seller, Price; the second to load data into the table from file ProductInfo.csv. Make all necessary assumptions.
> For each situation presented below, illustrate a document as depicted in Figures 10-4 and 10-5 and specify whether it contains an array, an embedded subdocument, relationships, or collections. Use hypothetical data and make necessary assumptions. a. A do
> Review Figure 10-15 and answer the following questions based on it. a. What has happened between Input and Input’? b. Assume that the values associated with each of the keys (k1, k2, and so forth) are counts. What is the purpose of the
> Figure 10-14 describes a simple Hadoop architecture. If a real-world system is implemented using this approach, it will suffer from a specific weakness. Identify what this weakness is and find out what the latest versions of Hadoop have done to address i
> Assume that the following data regarding Students need to be stored—Name: First Name and Last Name, Roll Number, and Mobile Number. Illustrate with figures how it would be stored in different NoSQL database models.
> Review Figure 10-5 (a). Write a MongoDB query to display all products with review ratings greater than 3 stars and suppress the fields “height” and “width” in the output using the su
> Review Figure 10-3. For each of the formats, identify the elements that are data values and those that are labels describing the data. Data from Figure 10-3:
> Compare the JSON and XML representations of a record in Figure 10-1. What is the primary difference between these? Can you identify any advantages of one compared to the other? Data from Figure 10-1:
> GROUP BY by itself creates subtotals by category, and the ROLLUP extension to GROUP BY creates even more categories for subtotals. Using all the orders, do a rollup to get total order amounts by product, sales region, and month and all combinations, incl
> Because data warehouses and even data marts can become very large, it may be sufficient to work with a subset of data for some analyses. Create a sample of orders from 2004 using the SAMPLE SQL command (which is standard SQL); put a randomized allocation
> Consider the data needs of a small accounting department at a tax services firm. What would some of the data entities be in this setting? List and explain their relevance. Develop a project data model for this firm applying the database design concepts y
> Using the MDIFF “ordered analytical function” in Teradata SQL (see the Functions and Operators manual), show the differences (label the difference CHANGE) in TOTAL from quarter to quarter. Hint: You will likely create a derived table based on your query
> Take the query you scrapped from Problem and Exercise 9-58 and modify it to show only the U.S. region grouped by each quarter, not just for 2005 but for all years available, in order by quarter. Label the total orders by quarter with the heading TOTAL an
> The database you are using was developed by MicroStrategy, a leading business intelligence software vendor. The MicroStrategy software is also available on TUN. Most business intelligence tools generate SQL to retrieve the data they need to produce the r
> Review the metadata file for the db_samwh database and the definitions of the database tables. (You can use SHOW TABLE commands to display the DDL for tables.) Are dimension tables conformed in this data mart? Explain.
> Review the metadata file for the db_samwh database and the definitions of the database tables. (You can use SHOW TABLE commands to display the DDL for tables.) Explain what dimension data, if any, are maintained to support slowly changing dimensions. If
> Review the metadata file for the db_samwh database and the definitions of the database tables. (You can use SHOW TABLE commands to display the DDL for tables.) Explain the methods used in this database for modeling hierarchies. Are hierarchies modeled as
> Fitchwood Insurance Company, which is involved primarily in the sale of annuity products, would like to design a data mart for its sales and marketing organization. Presently, the OLTP system is a legacy system residing on a shared network drive consisti
> Fitchwood Insurance Company, which is involved primarily in the sale of annuity products, would like to design a data mart for its sales and marketing organization. Presently, the OLTP system is a legacy system residing on a shared network drive consisti
> Fitchwood Insurance Company, which is involved primarily in the sale of annuity products, would like to design a data mart for its sales and marketing organization. Presently, the OLTP system is a legacy system residing on a shared network drive consisti
> In the section “Disadvantages of File Processing Systems,” the statement is made that the disadvantages of file processing systems can also be limitations of databases, depending on how an organization manages its databases. First, why do organizations c
> Contrast the following terms: a. data dependence; data independence b. structured data; unstructured data c. metadata; data d. repository; database e. entity; enterprise data model f. data warehouse; data lake g. personal databases; multi-tiered database
> Fitchwood Insurance Company, which is involved primarily in the sale of annuity products, would like to design a data mart for its sales and marketing organization. Presently, the OLTP system is a legacy system residing on a shared network drive consisti
> Fitchwood Insurance Company, which is involved primarily in the sale of annuity products, would like to design a data mart for its sales and marketing organization. Presently, the OLTP system is a legacy system residing on a shared network drive consisti
> Fitchwood Insurance Company, which is involved primarily in the sale of annuity products, would like to design a data mart for its sales and marketing organization. Presently, the OLTP system is a legacy system residing on a shared network drive consisti
> Fitchwood Insurance Company, which is involved primarily in the sale of annuity products, would like to design a data mart for its sales and marketing organization. Presently, the OLTP system is a legacy system residing on a shared network drive consisti
> Fitchwood Insurance Company, which is involved primarily in the sale of annuity products, would like to design a data mart for its sales and marketing organization. Presently, the OLTP system is a legacy system residing on a shared network drive consisti
> Pine Valley Furniture wants you to help design a data mart for analysis of sales. The subjects of the data mart are as follows: Salesperson: Attributes: SalespersonID, Years with PVFC, SalespersonName, and SupervisorRating. Product: Attributes: ProductID
> A firm wants to reduce fluid drilling costs substantially by increasing drilling fluid efficiency. Research finds that both fluid drilling speed and cost are significantly influenced by Time, Geography, Drilling fluid type, Formation, and Well type. Geog
> A pharmaceutical retail store manages its current sales, procurement and materials availability at the store through Excel sheets. Owing to the increase in the number of branches in the city, the store manager is now finding this process of data maintena
> A university gathers student admission data from three different sources: through forms filled manually at university desks, by registering at the university Web site, or by registering on the department’s Web site. All the three sources have disparate f
> Employees working in IT organizations are assigned different projects for a specific duration, such as a few months or years. The duration is specified by the project start date and end date in the database. The project location is different for each pro
> Table 1-1 shows example metadata for a set of data items. Identify three other columns for these data (i.e., three other metadata characteristics for the listed attributes) and complete the entries of the table in Table 1-1 for these three additional col
> Simplified Automobile Insurance Company would like to add a Claims dimension to its star schema. Attributes of Claim are ClaimID, ClaimDescription, and ClaimType. Attributes of the fact table are now PolicyPremium, Deductible, and MonthlyClaimTotal. a. E
> You are to construct a star schema for Simplified Automobile Insurance Company (for a more realistic example, Kimball, 1996b). The relevant dimensions, dimension attributes, and dimension sizes are as follows: InsuredParty: Attributes: InsuredPartyID and
> A table Student stores the values StudentID, name, date of result, and total marks obtained. A student’s information is: StudentID: S876, Name: Sabcd, Date of result: 22/12/14, and Total marks obtained: 650. An update transaction has changed the date and
> Drilling often accounts for one-third to two-thirds of the total cost in the search for fluid. Advances in drilling technology can reduce these costs substantially. The key point is redesigning the scheme of drilling fluid. A research study identifies th
> The following table shows some simple album and price data as of the date 07/18/2015: The following transactions occur on 07/19/2015: • Album K3 price discounted to $7. • Album K5 is deleted from the file. â€
> Examine the three tables with student data shown in Figure 9-1. Design a single-table format that will hold all of the data (non-redundantly) that are contained in these three tables. Choose column names that you believe are most appropriate for these da
> Based on the table above as well as additional research, write a memo in support of or against the following statement: “Cloud databases will increasingly eliminate the need for data/database administrators in corporations.”
> Assume that a bank operates multinational and has millions of financial records of customers in its database. The bank also offers e-banking services to its clients. Based on what you have learned from the book, suggest how they can take regular backups
> Revisit the six issues identified in Problem and Exercise 8-72. What risk, if any, do each of them pose to the firm? Data from Problem and Exercise 8-72: During the Sarbanes-Oxley audit of a financial services company, you note the following issues. Cat
> During the Sarbanes-Oxley audit of a financial services company, you note the following issues. Categorize each of them into the area to which they belong: IT change management, logical access to data, and IT operations. a. Five DBAs have access to the S
> You are the manager of a department in a small logistics company. The current database system being used is hierarchical, and you have been tasked to formulate a team that can create a plan to develop a more efficient database system that is consistent w
> A number of situations have been listed below. For each one, identify the need, if any, to create an index. Justify your answer. If there is indeed a need, suggest a way for the index to be created. a. Banking applications that involve frequent retrieval
> For each of the situations described, decide which technique for data field design listed below would be most appropriate and how it could be applied. • Code lookup table • Default value • Range control • Referential integrity • Handling missing data a
> Fill in the two authorization tables for Pine Valley Furniture Company below based on the following assumptions (enter Y for yes or N for no): • Salespersons, managers, and carpenters may read inventory records but may not perform any o
> Refer to Figure 4-5. For each of the following reports (with sample data), indicate any indexes that you feel would help the report run faster as well as the type of index: a. State, by products (user-specified period) State, by Products Report, January
> Consider the EER diagram for Pine Valley Furniture shown in Figure 3-12. Figure 8-15 looks at a portion of that EER diagram. Let’s make a few assumptions about the average usage of the system: • There are 60,000 custom
> Consider the composite usage map in Figure 8-1. After a period of time, the assumptions for this usage map have changed, as follows: • There is an average of 60 supplies (rather than 40) for each supplier. • Manufactur
> Create an index on the CustomerID column of the Customer_T and Order_T table in Figure 4-4. Data from Figure 4-4:
> Consider the following assumptions: • A music company offers three types of music genres: Jazz, Hip-hop, and Metal (subtypes of the Genre supertype). An “Artist” instances “Records” of these Genres. • There are total of 8,000 songs in company’s database,
> Parallel query processing, as described in this chapter, means that the same query is run on multiple processors and that each processor accesses in parallel a different subset of the database. Another form of parallel query processing, not discussed in
> Suggest an application for each type of file organization. Explain your answer.
> Visit the PHP website (php.net) and investigate how a failure to sanitize database inputs can leave a database exposed to online attack.
> Assume that the most important reports that the organization needs are as follows: • A list of the current developer’s project assignments. • A list of the total costs for all projects. • For each team, a list of its membership history. • For each countr
> Consider Figure 4-35 and your answer to Problem and Exercise 4-44 in Chapter 4. Assume that it is essential that customers who had rented from Vacation Property Rentals earlier can be identified quickly based on their last name–first na
> Specify the format for the Oracle date data type. How does it account for change in century? What is the purpose of ‘TIMESTAMP WITH LOCAL TIMEZONE’? Suppose the system time zone in database in City A = –9:00 and City B = –4:00. A client in City B inserts
> Consider the relations specified in Problem and Exercise 8-53. Assume that the database has been implemented without denormalization. Further assume that the database is global in scope and covers thousands of leagues, tens of thousands of teams, and hun
> Assume that the table BOOKS in a database with the primary key on BookID has more than 25,000 records. A query is frequently executed in which the Publisher of the book appears in the WHERE clause. The Publisher field has more than 100 different values,
> A company offering music services provides a search feature to its users and allows them to mix music (a key feature for disc jockeys), which is supported through parallel processing. All music information is stored in a database management system. a. Wh
> Search the Internet for at least three examples where parallel processing is applied. How was the underlying database prepared for this? What were the advantages of this implementation?
> Use the Internet to search for examples of each type of horizontal partitioning provided by Oracle. Explain your answer.
> Consider the following normalized relations for a sports league: TEAM(TeamID, TeamName, TeamLocation, TeamManager) PLAYER(PlayerID, PlayerFirstName, PlayerLastName, PlayerDateOfBirth, PlayerSpecialtyCode) SPECIALTY(SpecialtyCode, SpecialtyDescription, Sa
> Consider the relations in Problem and Exercise 8-51. Identify possible opportunities for denormalizing these relations as part of the physical design of the database. Which ones would you be most likely to implement? Data from Problem and Exercise 8-51:
> On a smaller scale than in Field Exercise 7-25, investigate the computing architecture of a department within your university. Try to find out how well the current system is meeting the department’s information-processing needs. Data from Exercise 7-25:
> Consider the following set of normalized relations from a database used by a mobile service provide to keep track of its users and advertiser customers. USER(UserID, UserLName, UserFName, UserEmail, UserYearOfBirth, UserCategoryID, UserZip) ADVERTISERCLI
> Consider the following normalized relations from a database in a large retail chain: STORE (StoreID, Region, ManagerID, SquareFeet) EMPLOYEE (EmployeeID, WhereWork, EmployeeName, EmployeeAddress) DEPARTMENT (DepartmentID, ManagerID, SalesGoal) SCHEDULE (
> When students fill out forms for admission to various courses or to write their exams, they leave many missing values. This may lead to issues while compiling data. Can this be handled at the data capture stage? What are the alternate approaches to handl
> In a normalized database, all customer information is stored in a Customer table, invoices are stored in an Invoice table, and related account manager information in an Employee table. Suppose a customer changes their address and then demands old invoice
> Say that you are interested in storing the numeric value 3,456,349.2334. What will be stored with each of the following Oracle data types? a. NUMBER(11) b. NUMBER(11,1) c. NUMBER(11,-2) d. NUMBER(11,6) e. NUMBER(6) f. NUMBER