Content text 3. DMBI Winter 22.pdf
1 GUJARAT TECHNOLOGICAL UNIVERSITY BE – SEMESTER - VI EXAMINATION – WINTER 2022 Subject Code : 3163209 Date : 19/12/2022 Subject Name : Data Mining and Business Intelligence Total Marks : 70 Q : 1 Q.1 (a) Define: Information Gain, Gain Ratio & Gini Index. [03] All three concepts - Information Gain, Gain Ratio, and Gini Index - are used in decision tree algorithms to evaluate the quality of a split at a particular node. They essentially measure how well a specific feature separates the data into distinct classes. Here's a breakdown of each: 1. Information Gain (IG): Information gain measures the reduction in uncertainty (entropy) about the target variable (class) after splitting the data based on a particular feature. It's calculated using the concept of entropy, which is borrowed from information theory. Entropy represents the average amount of information needed to classify a data point. A higher information gain indicates that the feature is more effective in separating the data into pure classes, leading to a more informative split. 2. Gain Ratio (GR): Gain ratio is a variation of information gain that addresses a potential bias towards features with many possible values. It penalizes features with a large number of splits (many branches) by incorporating a "split information" term in the calculation. This penalty term discourages the algorithm from favoring features with numerous categories that might not necessarily provide the most relevant separation. Gain ratio is generally preferred when dealing with features that have a high number of possible values. 3. Gini Index (GI): The Gini index measures the impurity (or lack of uniformity) within a dataset for a particular class label. It's calculated by considering the probability of a data point being classified incorrectly if randomly picked from the dataset. A lower Gini index indicates a more homogeneous group of data points after the split, meaning the classes are well-separated based on the chosen feature.
2 Q.1 (b) Explain Rule-based Classification in brief. [04] Rule-based Classification is a method in data mining where the classification model consists of a set of "if-then" rules. These rules are used to assign a class label to a given data instance based on the attribute values of the instance. The rules are typically derived from the training data and are designed to capture patterns or relationships between attributes and the target class. Key Characteristics of Rule-based Classification 1. If-Then Rules: The classification model is made up of simple "if-then" rules. For example, a rule might be: "If (Age > 30) and (Income > 50K) then (Class = High-Spender)." 2. Condition and Result : Each rule has two parts: Antecedent (Condition): The "if" part, which specifies a condition or a conjunction of conditions on the attributes. Consequent (Result): The "then" part, which specifies the class label assigned to instances that satisfy the antecedent. 3. Rule Construction: Rules can be constructed using different methods, such as: Directly from Data: Algorithms like RIPPER, CN2, and decision tree-based approaches. From Decision Trees: Converting paths from the root to the leaves of a decision tree into rules. 4. Coverage and Accuracy: Important metrics for evaluating rules: Coverage: The proportion of instances in the dataset that satisfy the antecedent of the rule. Accuracy: The proportion of instances covered by the rule that are correctly classified. 5. Rule Pruning: To prevent overfitting and improve generalization, rules can be pruned by removing conditions that do not contribute significantly to the rule's accuracy. 6. Conflict Resolution: When multiple rules apply to an instance, strategies like rule ordering (priority given to specific rules) or voting schemes (combining outputs of multiple rules) can be used to determine the final class label. Advantages of Rule-based Classification Interpretability: The if-then rules are easy to understand and interpret, making the model transparent to users. Flexibility: Rules can be easily modified, added, or removed without affecting the entire model. Incremental Learning: New rules can be incorporated without re-training the entire model from scratch. Disadvantages of Rule-based Classification Scalability: For large datasets with many attributes, the number of potential rules can grow exponentially, making the model complex. Overfitting: There is a risk of overfitting, especially if too many rules are generated to fit the training data perfectly. Conflicts: Managing conflicts between rules can be challenging, particularly when multiple rules apply to the same instance with different class labels.
4 Q : 2 Q.2 (a) What is meta data repository? Explain. [03] A metadata repository is a centralized database or a collection of databases that store metadata, which is data about data. Metadata repositories are critical in data management, data governance, and business intelligence because they provide a way to manage, organize, and retrieve metadata efficiently. Key Functions of a Metadata Repository 1. Centralized Storage: It acts as a single point of storage for all metadata related to the organization's data assets. This includes metadata for databases, data warehouses, data lakes, ETL processes, and more. 2. Metadata Management: Provides tools and frameworks for managing metadata, including creation, update, deletion, and versioning. 3. Data Governance: Supports data governance initiatives by providing metadata that helps ensure data quality, consistency, compliance, and security. It enables tracking of data lineage, data ownership, and usage policies. 4. Data Discovery: Facilitates data discovery by allowing users to search for and understand data assets based on their metadata. This can include schema definitions, data types, data sources, and relationships between different data elements. 5. Interoperability and Integration: Enhances interoperability and integration across various data systems by providing a common understanding and description of data assets. 6. Impact Analysis: Helps in impact analysis by showing how changes to data elements affect other parts of the data ecosystem. This is crucial for maintaining system integrity and planning changes. Types of Metadata Stores A metadata repository typically stores different types of metadata: 1. Technical Metadata: Describes the technical aspects of data, such as data types, structures, constraints, schemas, and data models. Examples include table definitions, column data types, indexes, and relationships between tables. 2. Business Metadata: Provides a business context to the data, including definitions, business rules, and classifications. Examples include data definitions, glossaries, data quality rules, and data usage policies. 3. Operational Metadata: Contains information about the operations and processes involving data, such as ETL processes, data lineage, and data transformations. Examples include process workflows, job schedules, and execution logs. 4. Usage Metadata: Tracks how data is used within the organization, including user access, query logs, and data consumption patterns. Examples include user activity logs, data access reports, and performance metrics. Benefits of a Metadata Repository 1. Improved Data Quality and Consistency: By providing a centralized view of metadata, organizations can ensure that data definitions and standards are consistently applied across all systems.