Download the Latest Professional-Data-Engineer Dumps - 2022 Professional-Data-Engineer Exam Questions
Latest Google Professional-Data-Engineer Certification Practice Test Questions
Data Engineering on Google Cloud course
It is a 4-day course that gives hands-on experience to the candidates and allows them to build data processing systems on Google Cloud. It will also show you how to design data processing systems, analyze data and build end-to-end data pipelines and machine learning. In order to get a better understanding of the course, you need to complete the big data machine learning course or get equivalent experience. This course also aids you in developing applications using a programming language such as Python and covers the following objective:
- Designing and building data processing systems on the Google Cloud Platform
- Processing batch and streaming data by using autoscaling data pipelines on Cloud Dataflow
- Enable insights from streaming data
- Influencing unstructured data using ML APIs on Cloud Dataproc
- Predicting machine models using TensorFlow and Cloud ML
NEW QUESTION 15
You are choosing a NoSQL database to handle telemetry data submitted from millions of Internet-of- Things (IoT) devices. The volume of data is growing at 100 TB per year, and each data entry has about
100 attributes. The data processing pipeline does not require atomicity, consistency, isolation, and durability (ACID). However, high availability and low latency are required.
You need to analyze the data by querying against individual fields. Which three databases meet your requirements? (Choose three.)
- A. HDFS with Hive
- B. Redis
- C. HBase
- D. MySQL
- E. MongoDB
- F. Cassandra
Answer: A,C,E
NEW QUESTION 16
You have an Apache Kafka cluster on-prem with topics containing web application logs. You need to replicate the data to Google Cloud for analysis in BigQuery and Cloud Storage. The preferred replication method is mirroring to avoid deployment of Kafka Connect plugins.
What should you do?
- A. Deploy the PubSub Kafka connector to your on-prem Kafka cluster and configure PubSub as a Sink connector. Use a Dataflow job to read from PubSub and write to GCS.
- B. Deploy a Kafka cluster on GCE VM Instances. Configure your on-prem cluster to mirror your topics to the cluster running in GCE. Use a Dataproc cluster or Dataflow job to read from Kafka and write to GCS.
- C. Deploy a Kafka cluster on GCE VM Instances with the PubSub Kafka connector configured as a Sink connector. Use a Dataproc cluster or Dataflow job to read from Kafka and write to GCS.
- D. Deploy the PubSub Kafka connector to your on-prem Kafka cluster and configure PubSub as a Source connector. Use a Dataflow job to read from PubSub and write to GCS.
Answer: B
NEW QUESTION 17
Which of the following are feature engineering techniques? (Select 2 answers)
- A. Feature prioritization
- B. Bucketization of a continuous feature
- C. Crossed feature columns
- D. Hidden feature layers
Answer: B,C
Explanation:
Explanation
Selecting and crafting the right set of feature columns is key to learning an effective model.
Bucketization is a process of dividing the entire range of a continuous feature into a set of consecutive bins/buckets, and then converting the original numerical feature into a bucket ID (as a categorical feature) depending on which bucket that value falls into.
Using each base feature column separately may not be enough to explain the data. To learn the differences between different feature combinations, we can add crossed feature columns to the model.
Reference:
https://www.tensorflow.org/tutorials/wide#selecting_and_engineering_features_for_the_model
NEW QUESTION 18
You are choosing a NoSQL database to handle telemetry data submitted from millions of Internet-of-Things (IoT) devices. The volume of data is growing at 100 TB per year, and each data entry has about 100 attributes.
The data processing pipeline does not require atomicity, consistency, isolation, and durability (ACID).
However, high availability and low latency are required.
You need to analyze the data by querying against individual fields. Which three databases meet your requirements? (Choose three.)
- A. HDFS with Hive
- B. Redis
- C. HBase
- D. MySQL
- E. MongoDB
- F. Cassandra
Answer: A,C,E
NEW QUESTION 19
You have a query that filters a BigQuery table using a WHERE clause on timestamp and ID columns. By using bq query - -dry_run you learn that the query triggers a full scan of the table, even though the filter on timestamp and ID select a tiny fraction of the overall data. You want to reduce the amount of data scanned by BigQuery with minimal changes to existing SQL queries. What should you do?
- A. Use the LIMIT keyword to reduce the number of rows returned.
- B. Create a separate table for each ID.
- C. Recreate the table with a partitioning column and clustering column.
- D. Use the bq query - -maximum_bytes_billedflag to restrict the number of bytes billed.
Answer: A
Explanation:
Explanation
NEW QUESTION 20
You are designing a cloud-native historical data processing system to meet the following conditions:
* The data being analyzed is in CSV, Avro, and PDF formats and will be accessed by multiple analysis tools including Cloud Dataproc, BigQuery, and Compute Engine.
* A streaming data pipeline stores new data daily.
* Peformance is not a factor in the solution.
* The solution design should maximize availability.
How should you design data storage for this solution?
- A. Store the data in a regional Cloud Storage bucket. Aceess the bucket directly using Cloud Dataproc, BigQuery, and Compute Engine.
- B. Store the data in a multi-regional Cloud Storage bucket. Access the data directly using Cloud Dataproc, BigQuery, and Compute Engine.
- C. Store the data in BigQuery. Access the data using the BigQuery Connector or Cloud Dataproc and Compute Engine.
- D. Create a Cloud Dataproc cluster with high availability. Store the data in HDFS, and peform analysis as needed.
Answer: A
NEW QUESTION 21
Which of these are examples of a value in a sparse vector? (Select 2 answers.)
- A. [0, 0, 0, 1, 0, 0, 1]
- B. [0, 1]
- C. [1, 0, 0, 0, 0, 0, 0]
- D. [0, 5, 0, 0, 0, 0]
Answer: B,C
Explanation:
Explanation
Categorical features in linear models are typically translated into a sparse vector in which each possible value has a corresponding index or id. For example, if there are only three possible eye colors you can represent
'eye_color' as a length 3 vector: 'brown' would become [1, 0, 0], 'blue' would become [0, 1, 0] and 'green' would become [0, 0, 1]. These vectors are called "sparse" because they may be very long, with many zeros, when the set of possible values is very large (such as all English words).
[0, 0, 0, 1, 0, 0, 1] is not a sparse vector because it has two 1s in it. A sparse vector contains only a single 1.
[0, 5, 0, 0, 0, 0] is not a sparse vector because it has a 5 in it. Sparse vectors only contain 0s and 1s.
Reference: https://www.tensorflow.org/tutorials/linear#feature_columns_and_transformations
NEW QUESTION 22
You operate an IoT pipeline built around Apache Kafka that normally receives around 5000 messages per second. You want to use Google Cloud Platform to create an alert as soon as the moving average over 1 hour drops below 4000 messages per second. What should you do?
- A. Consume the stream of data in Cloud Dataflow using Kafka IO. Set a sliding time window of 1 hour every 5 minutes. Compute the average when the window closes, and send an alert if the average is less than 4000 messages.
- B. Use Kafka Connect to link your Kafka message queue to Cloud Pub/Sub. Use a Cloud Dataflow template to write your messages from Cloud Pub/Sub to Cloud Bigtable. Use Cloud Scheduler to run a script every hour that counts the number of rows created in Cloud Bigtable in the last hour. If that number falls below 4000, send an alert.
- C. Use Kafka Connect to link your Kafka message queue to Cloud Pub/Sub. Use a Cloud Dataflow template to write your messages from Cloud Pub/Sub to BigQuery. Use Cloud Scheduler to run a script every five minutes that counts the number of rows created in BigQuery in the last hour. If that number falls below
4000, send an alert. - D. Consume the stream of data in Cloud Dataflow using Kafka IO. Set a fixed time window of 1 hour. Compute the average when the window closes, and send an alert if the average is less than 4000 messages.
Answer: B
NEW QUESTION 23
You designed a database for patient records as a pilot project to cover a few hundred patients in three clinics.
Your design used a single database table to represent all patients and their visits, and you used self-joins to generate reports. The server resource utilization was at 50%. Since then, the scope of the project has expanded.
The database must now store 100 times more patient records. You can no longer run the reports, because they either take too long or they encounter errors with insufficient compute resources. How should you adjust the database design?
- A. Normalize the master patient-record table into the patient table and the visits table, and create other necessary tables to avoid self-join.
- B. Partition the table into smaller tables, with one for each clinic. Run queries against the smaller table pairs, and use unions for consolidated reports.
- C. Shard the tables into smaller ones based on date ranges, and only generate reports with prespecified date ranges.
- D. Add capacity (memory and disk space) to the database server by the order of 200.
Answer: A
NEW QUESTION 24
You are designing a basket abandonment system for an ecommerce company. The system will send a message to a user based on these rules:
No interaction by the user on the site for 1 hour
Has added more than $30 worth of products to the basket
Has not completed a transaction
You use Google Cloud Dataflow to process the data and decide if a message should be sent. How should you design the pipeline?
- A. Use a sliding time window with a duration of 60 minutes.
- B. Use a session window with a gap time duration of 60 minutes.
- C. Use a global window with a time based trigger with a delay of 60 minutes.
- D. Use a fixed-time window with a duration of 60 minutes.
Answer: C
NEW QUESTION 25
You are a retailer that wants to integrate your online sales capabilities with different in-home assistants, such as Google Home. You need to interpret customer voice commands and issue an order to the backend systems.
Which solutions should you choose?
- A. Cloud Speech-to-Text API
- B. Cloud Natural Language API
- C. Cloud AutoML Natural Language
- D. Dialogflow Enterprise Edition
Answer: D
NEW QUESTION 26
You want to use a database of information about tissue samples to classify future tissue samples as either normal or mutated. You are evaluating an unsupervised anomaly detection method for classifying the tissue samples. Which two characteristic support this method? (Choose two.)
- A. You expect future mutations to have different features from the mutated samples in the database.
- B. You already have labels for which samples are mutated and which are normal in the database.
- C. There are roughly equal occurrences of both normal and mutated samples in the database.
- D. You expect future mutations to have similar features to the mutated samples in the database.
- E. There are very few occurrences of mutations relative to normal samples.
Answer: D,E
Explanation:
Explanation
Unsupervised anomaly detection techniques detect anomalies in an unlabeled test data set under the assumption that the majority of the instances in the data set are normal by looking for instances that seem to fit least to the remainder of the data set. https://en.wikipedia.org/wiki/Anomaly_detection
NEW QUESTION 27
Your company built a TensorFlow neutral-network model with a large number of neurons and layers. The model fits well for the training data. However, when tested against new data, it performs poorly. What method can you employ to address this?
- A. Dropout Methods
- B. Dimensionality Reduction
- C. Serialization
- D. Threading
Answer: A
Explanation:
Explanation/Reference: https://medium.com/mlreview/a-simple-deep-learning-model-for-stock-price-prediction-using- tensorflow-30505541d877
NEW QUESTION 28
Which of the following is NOT one of the three main types of triggers that Dataflow supports?
- A. Trigger based on time
- B. Trigger that is a combination of other triggers
- C. Trigger based on element count
- D. Trigger based on element size in bytes
Answer: D
Explanation:
Explanation
There are three major kinds of triggers that Dataflow supports: 1. Time-based triggers 2. Data-driven triggers.
You can set a trigger to emit results from a window when that window has received a certain number of data elements. 3. Composite triggers. These triggers combine multiple time-based or data-driven triggers in some logical way Reference: https://cloud.google.com/dataflow/model/triggers
NEW QUESTION 29
An online retailer has built their current application on Google App Engine. A new initiative at the company mandates that they extend their application to allow their customers to transact directly via the application. They need to manage their shopping transactions and analyze combined data from multiple datasets using a business intelligence (BI) tool. They want to use only a single database for this purpose. Which Google Cloud database should they choose?
- A. Cloud Datastore
- B. BigQuery
- C. Cloud SQL
- D. Cloud BigTable
Answer: D
Explanation:
Explanation/Reference: https://cloud.google.com/solutions/business-intelligence/
NEW QUESTION 30
A data scientist has created a BigQuery ML model and asks you to create an ML pipeline to serve predictions.
You have a REST API application with the requirement to serve predictions for an individual user ID with latency under 100 milliseconds. You use the following query to generate predictions: SELECT predicted_label, user_id FROM ML.PREDICT (MODEL 'dataset.model', table user_features). How should you create the ML pipeline?
- A. Create an Authorized View with the provided query. Share the dataset that contains the view with the application service account.
- B. Create a Cloud Dataflow pipeline using BigQueryIOto read results from the query. Grant the Dataflow Worker role to the application service account.
- C. Add a WHERE clause to the query, and grant the BigQuery Data Viewer role to the application service account.
- D. Create a Cloud Dataflow pipeline using BigQueryIOto read predictions for all users from the query. Write the results to Cloud Bigtable using BigtableIO. Grant the Bigtable Reader role to the application service account so that the application can read predictions for individual users from Cloud Bigtable.
Answer: D
NEW QUESTION 31
An online retailer has built their current application on Google App Engine. A new initiative at the company mandates that they extend their application to allow their customers to transact directly via the application.
They need to manage their shopping transactions and analyze combined data from multiple datasets using a business intelligence (BI) tool. They want to use only a single database for this purpose. Which Google Cloud database should they choose?
- A. Cloud Datastore
- B. BigQuery
- C. Cloud SQL
- D. Cloud BigTable
Answer: D
Explanation:
Reference: https://cloud.google.com/solutions/business-intelligence/
NEW QUESTION 32
You are a retailer that wants to integrate your online sales capabilities with different in-home assistants, such as Google Home. You need to interpret customer voice commands and issue an order to the backend systems. Which solutions should you choose?
- A. Cloud Speech-to-Text API
- B. Cloud Natural Language API
- C. Dialogflow Enterprise Edition
- D. Cloud AutoML Natural Language
Answer: D
NEW QUESTION 33
You want to archive data in Cloud Storage. Because some data is very sensitive, you want to use the "Trust No One" (TNO) approach to encrypt your data to prevent the cloud provider staff from decrypting your data.
What should you do?
- A. Use gcloud kms keys create to create a symmetric key. Then use gcloud kms encrypt to encrypt each archival file with the key and unique additional authenticated data (AAD). Use gsutil cp to upload each encrypted file to the Cloud Storage bucket, and keep the AAD outside of Google Cloud.
- B. Specify customer-supplied encryption key (CSEK) in the .boto configuration file. Use gsutil cp to upload each archival file to the Cloud Storage bucket. Save the CSEK in a different project that only the security team can access.
- C. Use gcloud kms keys create to create a symmetric key. Then use gcloud kms encrypt to encrypt each archival file with the key. Use gsutil cp to upload each encrypted file to the Cloud Storage bucket.
Manually destroy the key previously used for encryption, and rotate the key once. - D. Specify customer-supplied encryption key (CSEK) in the .boto configuration file. Use gsutil cp to upload each archival file to the Cloud Storage bucket. Save the CSEK in Cloud Memorystore as permanent storage of the secret.
Answer: C
NEW QUESTION 34
Your company produces 20,000 files every hour. Each data file is formatted as a comma separated values (CSV) file that is less than 4 KB. All files must be ingested on Google Cloud Platform before they can be processed. Your company site has a 200 ms latency to Google Cloud, and your Internet connection bandwidth is limited as 50 Mbps. You currently deploy a secure FTP (SFTP) server on a virtual machine in Google Compute Engine as the data ingestion point. A local SFTP client runs on a dedicated machine to transmit the CSV files as is. The goal is to make reports with data from the previous day available to the executives by
10:00 a.m. each day. This design is barely able to keep up with the current volume, even though the bandwidth utilization is rather low.
You are told that due to seasonality, your company expects the number of files to double for the next three months. Which two actions should you take? (choose two.)
- A. Create an S3-compatible storage endpoint in your network, and use Google Cloud Storage Transfer Service to transfer on-premises data to the designated storage bucket.
- B. Redesign the data ingestion process to use gsutil tool to send the CSV files to a storage bucket in parallel.
- C. Contact your internet service provider (ISP) to increase your maximum bandwidth to at least 100 Mbps.
- D. Introduce data compression for each file to increase the rate file of file transfer.
- E. Assemble 1,000 files into a tape archive (TAR) file. Transmit the TAR files instead, and disassemble the CSV files in the cloud upon receiving them.
Answer: A,B
NEW QUESTION 35
......
Verified Professional-Data-Engineer Dumps Q&As - 1 Year Free & Quickly Updates: https://www.actualpdf.com/Professional-Data-Engineer_exam-dumps.html
Get 2022 Updated Free Google Professional-Data-Engineer Exam Questions & Answer: https://drive.google.com/open?id=1i7J1HMPiNqI3VyvK0JUpckxR0IkjpcTK
