I am Using numPartitions, lowerBound, upperBound in Spark Dataframe to fetch large tables from oracle to hive but unable to ingest complete data

I am Using numPartitions, lowerBound, upperBound in Spark Dataframe to fetch large tables from oracle to hive but unable to ingest complete data

I am Using numPartitions, lowerBound, upperBound in Spark Dataframe to fetch large tables from oracle.but it gives me data equivalent to only one partition.
Suppose number of partitions are 10 and total records are 100 then only 10 records (total records/numPartitions) i am able to ingest in Hive.
Below is my code snippet

val hiveContext = SparkApp.getHiveContext("AppName") val jdbcUsername = "MYUSERNAME" val jdbcPassword = "MYPASSWORD" val jdbcDatabase ="DBNAME" val jdbcUrl = "jdbc:oracle:thin:@//hostname:1522/servicename" val lowerBound = 1 val totalRecords = 100 val partitions = 10 val orclTableName = "MYTEST_TABLE" val columnName = "rownum" val connectionProperties = new Properties() connectionProperties.put("user", s"${jdbcUsername}") connectionProperties.put("password", s"${jdbcPassword}") connectionProperties.put("driver","oracle.jdbc.driver.OracleDriver") val orclTableDF = hiveContext.read.jdbc(url=jdbcUrl,table=orclTableName,columnName = columnName,lowerBound=lowerBound, upperBound= totalRecords , numPartitions=partitions, connectionProperties=connectionProperties)

orclTableDF.write.saveAsTable("MYTEST_NEW_TABLE")

Could you please let me know what i am missing.

1 Answer
1

Without partitionColumn parameter read won’t be parallelized. Please provide the column name for the partition key. Make sure with that key your data is evenly partitioned if not you may get data skew issue. If your data is not evenly partitioned then using rownum function evenly distribute your data by using mod operator.

In hiveContext.read.jdbc(), columnName property is same as the partitionColumn parameter for hiveContext.read.format("jdbc").option() . I have used hiveContext.read.format("jdbc").option() as well with the same parameter which you mentioned but i am facing same problem.
– Akhil Kakkar
Jul 3 at 8:06

below is the code snippet for this which i used, val orclTableDF = hiveContext.read.format("jdbc").option("url", "jdbc:oracle:thin:@//MyHost:1522/servicename").option("dbtable", "orclTableName").option("user", "MYUSERNAME").option("password", "MYPASSWORD").option("partitionColumn", "rownum").option("lowerBound", "1").option("upperBound", "100").option("numPartitions", "10").option("driver", "oracle.jdbc.driver.OracleDriver").load()
– Akhil Kakkar
Jul 3 at 8:10

Is the rownum column already exists in your dB which I doubt because in oracle to generate a sequence number you can use that. So in your dbtable parameter use the below query instead of table name “select rownum as rownseq,col1,col2 from schemaname.tablename” which will parallelise your query as per row num. if your total rows are 10 then spark will create 10 parallel sessions
– Chandan
Jul 3 at 9:20

Did it work for you
– Chandan
Jul 3 at 17:35

By clicking "Post Your Answer", you acknowledge that you have read our updated terms of service, privacy policy and cookie policy, and that your continued use of the website is subject to these policies.

u7qJIbBkWdVOs6V3eKH86PZRkXzPM P,sHZDYRBc2dUSqKdTx,e4ky17kSe6,g,teY9z0pmw1OwH h

搜尋此網誌

Fjhtyj