Application can create DataFrame easily with a SparkSession from _____. (D) PostgreSQL A : df.take(all) All data that is sent over the network, written to the disk or kept in memory should be serialized. (D) Name node
(D) Facebook Answer: (a), 91.
Using Accumulators Accumulators help update the values of variables in parallel while executing.
Which of the following is a transformation?
Spark need not be installed when running a job under YARN or Mesos because Spark can execute on top of YARN or Mesos clusters without affecting any change to the cluster. (C)Field These algorithms simply have to be called as methods in the Graph class and can just be reused rather than having to write our own implementation of these algorithms.
(D) yarn-scheduler.xml (A) jobtracker (D) All of the mentioned B : Takes RDD as input and produces one or more RDD as output. (B) config.xml 55) What makes Apache Spark good at low-latency workloads like graph processing and machine learning? (A) Persistence Answer: d, 25. B : RDDs are similar to the table in a relational database The best way to compute average is to first sum it and then divide it by count as shown below -. Which of the following is NOT an actions C : MLlib ____ is a distributed machine learning framework on top of Spark. Which of the following DataFrame commands is a narrow transform? (A) True An RDD that consists of row objects (wrappers around basic string or integer arrays) with schema information about the type of data in each column. Which of the following is not the feature of Spark?
Mesos acts as a unified scheduler that assigns tasks to either Spark or Hadoop.
B : Spark SQL C : Either fine-grained or coarse-grained The Coalesce method can only be used to decrease the number of partitions. def sum(x, y):return x+y;total =ProjectPrordd.reduce(sum);avg = total / ProjectPrordd.count(); However, the above code could lead to an overflow if the total becomes big. Apache Spark automatically persists the intermediary data from various shuffle operations, however it is often suggested that users call persist () method on the RDD in case they plan to reuse it. C : machine learning (D) None of the above PySpark supports custom serializers, two of which: MarshalSerializer: This serializer is faster than the PickleSerializer but supports fewer datatypes.
The 3 different clusters managers supported in Apache Spark are: 11) How can Spark be connected to Apache Mesos? C : Logistic Regression Which holds the list of rules for queue placement in Fair Scheduling? Answer: a, 36. C : EMR
51)What are the disadvantages of using Apache Spark over Hadoop MapReduce? Coalesce in Spark is a method which is used to reduce the number of partitions in a DataFrame. (D) None of the above (A) Supports in-memory computation Answer: (b), 81. Answer: (b), 59. Answer: D, 3. (D) All of the mentioned
D : count(), Q.26. Different parameters using the SparkConf object and their parameters can be used rather than the system properties, if they are specified. The write operation on RDD is 56) Is it necessary to start Hadoop to run any Apache Spark Application ? (C) mongod C : df.orderBy() Parquet file is a columnar format file that helps . Which of the following true about mongoDB?
61) Suppose that there is an RDD named ProjectProrddthat contains a huge list of numbers. Coalesce is to be ideally used in cases where one wants to store the same data in a lesser number of files. Discretized Stream is a sequence of Resilient Distributed Databases that represent a stream of data. Resilient If a node holding the partition fails the other node takes the data. (B) database Which of the following represent column in mongoDB? The sizes and numbers of the stratified samples are determined by the storage availability specified when importing the data. C : GraphX B : RDD in Apache Spark is an immutable collection of objects (B) Spark SQL (D) All of the above C : df.orderBy() def ProjectProAvg(x, y):return (x+y)/2.0;avg = ProjectPrordd.reduce(ProjectProAvg); What is wrong with the above code and how will you correct it ?
Also, Spark does have its own file management system and hence needs to be integrated with other cloud based data platforms or apache hadoop. (A) yarn-site.xml
D : None of above, Q.34.
Here Spark uses Akka for messaging between the workers and masters. flatMap() can give a result which contains redundant data in some columns. 33) Which one will you choose for a project Hadoop MapReduce or Apache Spark? (A) Scalability They are tied to a system database and can only be created and accessed using the qualified name global_temp. What is action in Spark RDD? D : All of the above, The ways to send result from executors to the driver, Q.24. D : Improves the performance of iterative algorithm drastically. Answer: (b), 43. Which among the following is Hadoops cluster resource management system? (B) Capacity (B) Takes RDD as input and produces one or more RDD as output. A : df.distinct() Answer: d, 23. (D) None of the above (C) Apache Slider Which of the following is not a component of the Spark Ecosystem? Answer: (a), 75.
Answer: (d), 68. RDD is fault-tolerant and immutable (A) Java (A) dynamic schema 43) How can you launch Spark jobs inside Hadoop MapReduce? Answer: b, 9. tranform function in spark streaming allows developers to use Apache Spark transformations on the underlying RDD's for the stream. B : 2008 D : 5, Q.4. (A) one per millisecond Answer: (c), 56. C : both
(A) Persistence (C) task bookkeeping Spark performs shuffling to repartition the data across different executors or across different machines in a cluster. (A) Infosphere (D) Embedded Documents D : df.map(), Q.40.
(D) Round Robin (D) None of the mentioned The log output for each job is written to the work directory of the slave nodes.
____ systems are scale-out file-based (HDD) systems moving to more uses of memory in the nodes. B : Spark Streaming B : 3 (B) Document databases
Receivers areusually created by streaming contexts as long running tasks on various executors and scheduled to operate in a round robin manner with each receiver taking a single core. Which of the following algorithm is not present in MLlib? Answer: d, 11. These are read only variables, present in-memory cache on every machine. Answer: a, 19. (B) Coarse-grained Whenever the window slides, the RDDs that fall within the particular window are combined and operated upon to produce new RDDs of the windowed DStream. Learning Pig and Hive syntax takes time. Shuffling, by default, does not change the number of partitions but only the content within the partitions. D : All of the above, It is the scalable machine learning library which delivers efficiencies, Q.17. (D) reproductive research (D) Name node (B) RDD in Apache Spark is an immutable collection of objects Which database should use?
Documents in the same collection do not need to have the same set of fields or structure, and common fields in a collections documents may hold different types of data is known as ? Answer: (a), 70.
It is a technology which is part of the Hadoop framework which handles resource management and scheduling of jobs. A : MLlib (A) HCatalog (C) one per minute (D) RIS (D) None of the above (C) Both S1 and S2 Hadoop MapReduce well supported the need to process big data fast but there was always a need among developers to learn more flexible tools to keep up with the superior market of midsize big data sets, for real time data processing within seconds. (B) Support for different language APIs like Scala, Java, Python and R Which of the following language is MongoDB written in? Apache Sparks in-memory capability at times comes a major roadblock for cost efficient processing of big data. A : foreach() (B) NoSQL databases allow storing nonstructured dat(A) (A) FIFO (C) It is cost-efficient It renders scalable partitioning among various Spark instances and dynamic partitioning between Spark and other big data frameworks. A : df.filter() Which of the following is a tool of Machine Learning Library? ____ is a component on top of Spark Core. A : Broadcast variables are shared, immutable variables that are cached on every machine in the cluster instead of being serialized with every single task. A : Spark Streaming Spark SQL for SQL lovers - making it comparatively easier to use than Hadoop. All transformations are followed by actions. Which of the following Features of Apache Spark? (B) C Answer: c, 10. (A) Provides an execution platform for all the Spark applications (A) MLlib (D) All of the above C : first() (D) mongo Fault Tolerance in RDD is achieved using Answer: a, 29. Output operations that write data to an external system. Which of the following can be used to launch Spark jobs inside MapReduce? (A) FIFO (B) Distributed RDDs help achieve fault tolerance through lineage. (C) MongoDB is column-oriented database store C : It is a database For Multiclass classification problem which algorithm is not the solution? B : Support for different language APIs like Scala, Java, Python and R
A : Sqoop (B) DAG
Using SIMR (Spark in MapReduce) users can run any spark job inside MapReduce without requiring any admin rights. (A) Cassandra Spark SQL automatically infers the schema whereas in Hive schema needs to be explicitly declared. Caching can be handled in Spark Streaming by means of a change in settings on DStreams. (C) Fair Schedulers B : False Users can also create their own scalar and aggregate functions. Temp views in Spark SQL are tied to the Spark session that created the view, and will no longer be available upon termination of the Spark session. C : df.show() (C) GraphX No , it is not necessary because Apache Spark runs on top of YARN. (C) semi-structured
map is an elementary transformation whereas transform is an RDD transformation.
(B) application master 3) List some use cases where Spark outperforms Hadoop in processing. A : df.union() In Spark, map() transformation is applied to each row in a dataset to return a new dataset. Which of the following statements are NOT true for broadcast variables ? Answer: (a), 52.
Access to a curated library of 250+ end-to-end industry projects with solution code, videos and tech support. (A) Lazy-evaluation C : 200x
Sharding a database across many server instances can be achieved with _ Simplicity, Flexibility and Performance are the major advantages of using Spark over Hadoop. Spark was initially started by ____ at UC Berkeley AMPLab in 2009. Answer: (a), 46. (D) None of the above
It is designed in such a way that it has masters and workers, which are configured with a certain amount of allocated memory and CPU cores. Ans : a, 98. (C)Fair Schedulers (A) jobtracker An example is to find the mean of all values in a column. (C) NEWSQL is frequently the collection point for big data (C) When the retrieval of large quantities of data is needed Build a Big Data Project Portfolio by working on. Spark is engineered from the bottom-up for performance, running ___________ faster than Hadoop by exploiting in memory computing and other optimizations. The following spark code is written to calculate the average -. (B) Utilities like linear algebra, statistics (A)Database Answer: b, 37. Spark provides advanced analytic options like graph algorithms, machine learning, streaming data, etc, It has built-in APIs in multiple languages like Java, Scala, Python and R. It has good performance gains, as it helps run an application in the Hadoop cluster ten times faster on disk and 100 times faster in memory. Users can easily run Spark on top of Amazons ____ (D) yarn.scheduler.enable.preemption = true The following code block shows the details for a SparkConf class in PySpark. Answer: (b), 73. 1) Name some sources from where Spark streaming component can process real-time data. (B) tasktrackers Checkpoints are useful when the lineage graphs are long and have wide dependencies. (D)All of the above (c) MLlib (D) Oozie 3) What is the bottom layer of abstraction in the Spark Streaming API ? i) The operation is an action, if the return type is other than RDD. Spark is easier to program as it comes with an interactive mode. C : It is the scalable machine learning library which delivers efficiencies Apache Mesos -Has rich resource scheduling capabilities and is well suited to run Spark along with other applications. (A) Spark Streaming Answer:(b), 93. With the increasing demand from the industry, to process big data at a faster pace -Apache Spark is gaining huge momentum when it comes to enterprise adoption. 1) What are the various kinds of operators provided by Spark GraphX? Answer: (d), 85. (B) Capacity 18) What are the benefits of using Spark with Apache Mesos? Which of the following is true for RDD?
Most of the data users know only SQL and are not good at programming. (D) RDDs (C) RDDs C : It is cost-efficient 2) Name some companies that are already using Spark Streaming. (B) Multitenancy Answer: (b), 71. Dynamic sample selection module: selects the correct sample files at runtime based on the time and/or accuracy requirements of the query. (D) Decision Trees
Global temp views in Spark SQL are not tied to a particular Spark session, but can be shared across multiple Spark sessions. (B) S2 only ii) The operation is transformation, if the return type is same as the RDD. (D) one per nanosecond
Pair RDDs allow users to access each key in parallel. (B) one per second (B) 60 (A) A 12 byte hexadecimal value map() returns the same number of records as what was present in the input DataFrame. C : Pipelines flatMap() transformation is also applied to each row of the dataset, but a new flattened dataset is returned. Spark caches data in-memory and ensures low latency. The number of nodes can be decided by benchmarking the hardware and considering multiple factors such as optimal throughput (network speed), memory usage, the execution frameworks being used (YARN, Standalone or Mesos) and considering the other jobs that are running within those execution frameworks along with spark. The mask operator is used to construct a subgraph of the vertices and edges found in the input graph. D : df1 = df.select(unix_timestamp(col(timestamp_1),MM-dd-yyyy HH:mm:ss).alias(timestamp_1)), df1 = df.select(unix_timestamp(col(timestamp_1),MM-dd-yyyy HH:mm:ss).alias(timestamp_1)), Q.43. Launch various RDD actions() like first(), count() to begin parallel computation , which will then be optimized and executed by Spark. Answer: d, 7. 36) Is Apache Spark a good fit for Reinforcement learning? Derived relationships in Association Rule Mining are repres, Below are the latest 50 odd questions on azure. (D) Neither fine-grained nor coarse-grained D : None, Q.19. Answer: (b), 38. ____ is a online NoSQL developed by Clouder (D) SPARK (D) All of the above Run everything on the local node instead of distributing it. 1. We provide Apache Spark objective questions [pyspark](mcqs) along with the Apache Spark interview question and quiz. D : All of the above, Q.25. C : RDDs Which among the following schedulers in YARN is used by default? (C) Document C : df.orderBy()
A : df.persist(StorageLevel.MEMORY_ONLY) Driver- The process that runs the main () method of the program to create RDDs and perform transformations and actions on them. D : df.repartition(), Q.39. (A) FIFO Ans : a, 96. (A) MLlib (B) Space constraints (A) jobtracker (C) ARM Which of the following is not the feature of Spark? B : df1 = df.select(unix_timestamp(col(timestamp_1),MM-dd-yyyy HH:mm:ss, America/Los Angeles).alias(timestamp_1)) B : Fault-tolerance ____ stores are used to store information about networks, such as social connections. (B) MongoDB TheseApache Spark Projectswill help you develop skills which will make you eligible to apply for Spark developer job roles. (C) Scala Spark is intellectual in the manner in which it operates on data. Which of the following DataFrame commands will NOT generate a shuffle of data from each executor across the cluster? (A) Less complex applications, greater consistency. (D) None of the mentioned Which among the following schedulers attempts to allocate resources so that all running applications get the same share of resources in YARN (C) GraphX SQL databases are: B : printSchema() (B) run one application per workflow The reverse method is used to return a new graph with the edge directions reversed. The data can be stored in local file system, can be loaded from local file system and processed. ______________ leverages Spark Core fast scheduling capability to perform streaming analytics. Maintaining the required size of shuffle blocks. BlinkDB is a query engine for executing interactive SQL queries on huge volumes of data and renders query results marked with meaningful error bars. B : sort(asc_nulls_last(created_date))
BlinkDB builds a few stratified samples of the original data and then executes the queries on the samples, rather than the original data in order to reduce the time taken for query execution. (D) Stonebraker A : foreach() Which of the following companies developed NoSQL database Apache Cassandra? 8) Can you use Spark to access and analyse data stored in Cassandra databases?
Which among the following can be used for stream processing? (D) None of these (D) Graph (D) all of the mentioned Which of the following algorithm is not present in MLlib? You can trigger the clean-ups by setting the parameter spark.cleaner.ttl or by dividing the long running jobs into different batches and writing the intermediary results to the disk. C : Python Which among the following run tasks and send progress reports in MapReduce 1? (A) Key-value Q.15. Choose the tasks of jobtracker in MapReduce 1? (A) MongoDB is classified as a NoSQL database Save my name, email, and website in this browser for the next time I comment. C : df.count() broadcast variables are ______ and lazily replicated across all nodes in the cluster when an action is triggered Ans : b, 101. (B) RDDs are similar to the table in a relational database (C) Logistic Regression (D) None of the mentioned Shark is a tool, developed for people who are from a database background - to access Scala MLib capabilities through Hive like SQL interface. Stateless Transformations- Processing of the batch does not depend on the output of the previous batch. By using the persist() method on a DStream, every RDD of that particular DStream is kept persistent on memory, and can be used if the data in the DStream has to be used for computation multiple times. (C) collection
(A) Better scalability Cluster Manager-A pluggable component in Spark, to launch Executors and Drivers. (A) NoSQL
A framework in turn comprises the scheduler, which acts as a controller, and the executor, which carries out the work to be done.
B : DAG (C) machine learning Following represent column in NoSQL ____. (A) SQL Server (D) less than 2.5% (D) RDDs
C : 2009
C : df.sample(False, 5, 25) (B) False Answer: b, 35.
D : All of the above. Answer: d, 18. A : Supports in-memory computation (C) long-running application that is shared by different users It has all the basic functionalities of Spark, like - memory management, fault recovery, interacting with storage systems, scheduling tasks, etc. (B) When the data is predictable
5) What are some key differences in the Python API (PySpark) compared to the original Apache Spark? Initially, a SparkConf object can be created with SparkConf(), which will load the values from spark.
A : The ways to send result from executors to the driver A : True Transformations in Spark are not evaluated till you perform an action.
B : SIMR Apache Spark works well only for simple machine learning algorithms like clustering, regression, classification. D : There are less executors than total number of worker nodes.
Which among the following schedulers provides queue elasticity in YARN? (A) FIFO D : It provides a mutable variable that Spark cluster can safely update on a per-row basis.
These are m, @2014-2022 Crackyourinterview (All rights reserved). D : reduce(), Q.25. B : df.select() Transformations are functions executed on demand, to produce a new RDD.
Answer: a, 8. Answer: b, 17. Which of the following DataFrame commands is a wide transform? Answer: (d), 51. (or). RDDs (Resilient Distributed Datasets) are basic abstraction in Apache Spark that represent the data coming into the system in object format. In collaboration with and big data industry experts -we have curated a list of top 50 Apache Spark Interview Questions and Answers that will help students/professionals nail a big data developer interview and bridge the talent supply for Spark Developers across various industry segments.